三维场景点云理解与重建技术

Translated title of the contribution: Scene point cloud understanding and reconstruction technologies in 3D space

Jingyu Gong, Yujing Lou, Fengqi Liu, Zhiwei Zhang, Haoming Chen, Zhizhong Zhang, Xin Tan, Yuan Xie, Lizhuang Ma

Research output: Contribution to journalArticlepeer-review

11 Scopus citations

Abstract

3D scene understanding and reconstruction are essential for machine vision and intelligence,which aim to reconstruct completed models of real scenes from multiple scene scans and understand the semantic meanings of each func⁃ tional component in the scene. This technique is indispensable for real world digitalization and simulation,which can be widely used in related domains like robots,navigation system and virtual tourism. Its key challenges are required to be resolved on the three aspects:1)to recognize the same area in multiple real scans and fuse all the scans into an integrated scene point cloud;2)to make sense of the whole scene and recognize the semantics of multiple functional components;3) to complete the missing region in the original point cloud caused by occlusion during scanning. It is necessary to extract point cloud feature in order to fuse multiple real scene scans into an integrated point cloud, which can be invariant to scan⁃ ning position and rotation. Thus, intrinsic geometry features like point distance and singular value in neighborhood covari⁃ ance matrix are often involved in rotation-invariant feature design. Contrastive learning scheme is usually taken to help the learned features from the same area to be close to each other, while extracted features from different areas to be far away. To get generalization ability better, data augmentation of scanned point cloud can also be used during feature learning pro⁃ cess. Features-learnt pose estimation of scanning device can be configured to calculate the transformation matrix between point cloud pairs. After the transformation relationship is sorted out, the following point cloud fusion can be implemented using the raw point cloud scans. To further understand raw point cloud-based whole scene and segment the whole scene into functional parts on the basis of multiple semantics, an effective and efficient network with appropriate 3D convolution opera⁃ tion is required to parse entire points-based scene hierarchically, and specific learning schemes are necessary as well to adapt to various situation. The definition and formulation of basic convolution operation in 3D space is recognized as the core of pattern recognition for 3D scene point cloud. It is highly correlated to the approximated convolution kernel in 3D space where feature extraction can be developed in terms of appropriate point cloud grouping and down/up-sampling. The discrete approximation of 3D continuous convolution pursues being capable of recognizing various geometry pattern while keeping as few parameters as possible. Network design based on these elementary 3D convolution operations is also a funda⁃ mental part of outstanding scene parsing. Furthermore, point-level semantic segmentation of scanned scene can be linked mutually in relevance to such aspects of boundary detection, instance segmentation, and scene coloring, where network parameters are supervised through more auxiliary regularization. Semi-supervised methods and weak-supervised methods are required to overcome the lack of data annotation for real data. The segmentation results and semantic hints can be used to strengthen the fine-grained completion of object point cloud from scanned scene, in which the segmented objects can be handled separately, and semantics can be used to provide the structure and geometry prior when occlusion-derived missing region is completed. For the learning of object point cloud completion, it is crucial to learn a compact latent code space to represent all the complete shapes and design versatile decoder to reconstruct the structure and fine-grained geometry details of object point cloud. The learnt latent code space should contain complete shapes as much as possible, thus requiring large-scale synthetic model dataset for training to ensure the generalization ability. The encoder should be designed to rec⁃ ognize the structure of original point cloud and extract specific geometry pattern which preserves this information in latent code, while the decoder is used to recover the overall skeleton of original scanned objects and complete all the details according to the existing local geometry hints. For real scanned object completion, it is required to optimize the integration of latent code space further for synthetic models and real scanned point cloud. A cross-domain learning scheme is used to apply the knowledge of completion to real object scans, whereas the details of real scanned object can be preserved in the completed version. We analyze the current situation about scene understanding and reconstruction, including point cloud fusion, 3D convolution operation, entire scene segmentation, and fine-grained object completion. We analyze the frontier technologies and predict promising future research trends. It is significant for the following research to pay more attention on more open space with further challenges on computing efficiency, handling out-of-domain knowledge, and more complex situation with human-scene interaction. The 3D scene understanding and reconstruction technology will help the machine to understand the real world in a more natural way which can facilitate such various application domains like robots and navigation. It also potential to conduct plausible simulation of real world based on the reconstruction and parsing of real scenes, making it a useful tool in making various decisions.

Translated title of the contributionScene point cloud understanding and reconstruction technologies in 3D space
Original languageChinese (Traditional)
Pages (from-to)1741-1766
Number of pages26
JournalJournal of Image and Graphics
Volume28
Issue number6
DOIs
StatePublished - 2023

Fingerprint

Dive into the research topics of 'Scene point cloud understanding and reconstruction technologies in 3D space'. Together they form a unique fingerprint.

Cite this