Journals & Magazines >IEEE Access >Volume: 12

A Survey of Indoor 3D Reconstruction Based on RGB-D Cameras

This graphic abstract provides an overview of indoor 3D reconstruction using RGB-D cameras. It highlights the distinction between static and dynamic indoor environments, ...

Abstract:

With the advancement of consumer-grade RGB-D cameras, obtaining depth information for indoor 3D spaces has become increasingly accessible. This paper systematically revie...Show More

Metadata

Abstract:

With the advancement of consumer-grade RGB-D cameras, obtaining depth information for indoor 3D spaces has become increasingly accessible. This paper systematically reviews 3D reconstruction algorithms for indoor scenes using these cameras, serving as a reference for future research. We cover reconstruction processes and optimization algorithms for both static and dynamic scenes. Additionally, we discuss commonly used datasets, evaluation metrics, and the performance of various reconstruction algorithms. Findings indicate that the balance between reconstruction quality and speed in static scene reconstruction, as well as deformation, occlusion, and fast motion of objects in dynamic scenes are currently major concerns. Deep learning and Neural Radiance Fields (NeRF) are poised to provide new perspectives and methods to address these challenges.

This graphic abstract provides an overview of indoor 3D reconstruction using RGB-D cameras. It highlights the distinction between static and dynamic indoor environments, ...

Published in: IEEE Access ( Volume: 12)

Page(s): 112742 - 112766

Date of Publication: 13 August 2024

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2024.3443065

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

In recent years, with the rapid development of computer vision and artificial intelligence technologies, the application of 3D reconstruction technology in indoor environments has received widespread attention. Depending on the input data, 3D reconstruction algorithms can be divided into RGB-D camera-based, stereo array-based, visual-inertial-based, and monocular pure RGB-based types. Compared to other reconstruction methods, the RGB-D camera-based approach can directly obtain the color and depth information of each pixel, reducing the complex depth calculation process and more accurately capturing the geometric structure of objects. Additionally, this method has advantages such as high reconstruction accuracy, fast speed, and high system integration, making it very suitable for complex indoor reconstruction tasks. With the advent of consumer-grade depth cameras like Kinect and RealSense, their lower cost and higher real-time performance have greatly promoted the development and application of indoor 3D reconstruction. In smart home systems, using RGB-D cameras for indoor 3D reconstruction can generate accurate home models, enhancing the automation and intelligence levels of smart home systems. In logistics and warehouse management, it can increase the automation level of warehouse management and logistics operations, improving efficiency and accuracy. In robot navigation applications, using RGB-D cameras to generate 3D maps of the environment can enhance the robot’s autonomous navigation capabilities and task execution efficiency, improving its adaptability in complex environments. This has led many scholars to research indoor 3D reconstruction algorithms based on RGB-D cameras.

Reference [1] provides a comprehensive overview of the application of visual odometry and visual SLAM in the field of mobile robotics, discussing various sensor data fusion methods and emphasizing their application in actual robot navigation. Reference [2] summarizes the latest advancements in indoor scene modeling and discusses public datasets and programming libraries, but the technologies covered are only up to 2015. Reference [3] divides indoor scenes into static and dynamic scenes, mainly focusing on summarizing traditional reconstruction algorithms, with less attention to emerging deep learning methods. Reference [4] discusses in detail the working principles, applications, and role of RGB-D cameras in 3D reconstruction, introducing relevant datasets and future research directions. However, it lacks specific performance comparisons of the latest algorithms and technologies. Reference [5] is a recent review article on the latest indoor reconstruction algorithms, covering various RGB-D camera technologies and their application scenarios, but this article mainly focuses on static 3D reconstruction algorithms, with less consideration for applications in dynamic environments.

As we can see, some scholars have already summarized indoor reconstruction algorithms based on RGB-D cameras, but these studies have their limitations. Moreover, the development of deep learning technologies and neural radiance fields (NeRF) has also provided new directions for this field. Therefore, it is necessary to comprehensively review and summarize the applications of RGB-D cameras in indoor 3D reconstruction, providing a systematic knowledge framework to help researchers quickly understand the latest advancements and key technologies in this field. The main contributions of this paper are as follows: First, we classify indoor 3D reconstruction algorithms based on RGB-D cameras into static and dynamic scenes, revealing the advantages and disadvantages of each method through classification and comparison of different technical approaches, aiding researchers in choosing the most suitable technical path and optimizing existing methods. Second, we summarize the general process of static and dynamic 3D reconstruction algorithms and outline different reconstruction algorithms at each stage. Third, we update the applications of deep learning and neural radiance fields in this field, analyzing their advantages and disadvantages, providing new directions and solutions for future research. Fourth, we provide comprehensive RGB-D datasets and evaluation standards, offering reliable resources and tools to facilitate researchers in technical validation and performance comparison.

The main structure of this paper is as follows: Section II briefly reviews the development history of major static and dynamic 3D reconstruction algorithms in recent years. Section III provides a description of static scene reconstruction, dividing the reconstruction pipeline into different steps, and detailing the optimizations made by various researchers in each step. Section IV focuses on three-dimensional reconstruction of dynamic scenes, with the processing of dynamic objects being the main research content of this chapter. Section V introduces the datasets and evaluation metrics used in the research process. Finally, Section VI summarizes this paper.

SECTION II.

Related Work

In this subsection, we briefly review the development history of the main algorithms for indoor 3D reconstruction based on RGB-D cameras. Figure 1 shows some classic algorithms organized according to the timeline.

FIGURE 1.

The research history of 3D reconstruction in static and dynamic scenes.

Show All

In static scenes, RGB-D cameras are the only moving objects. By capturing the camera’s trajectory, we can fuse the obtained depth data into a reconstructed model, and then extract the surface to generate a static 3D model. Reference [6] proposed the first algorithm, KinectFusion, which utilizes an RGB-D camera for real-time 3D reconstruction. Additionally, they outlined a typical reconstruction pipeline for static 3D reconstruction, comprising depth map processing, camera pose estimation, scene reconstruction, and surface extraction. However, KinectFusion is limited by the voxel model and memory, and it can only perform reconstruction on small scenes. Kintinuous [7] extended KinectFusion to large scene reconstruction by moving the voxel model. In addition, it integrated loop detection and optimization, greatly improving the reconstruction quality. Moreover, Voxelhashing [8] employed voxel hashing as a model storage approach, significantly enhancing storage efficiency. Redwood [9] used an offline method to segment the input RGB-D sequence and separately reconstructed each segment, and then used the keyframes that overlap between the segments to register them, thereby reducing the accumulated error and obtaining high-quality 3D models. These methods are all based on voxel models. ElasticFusion [10] creatively used surfel representation to continuously optimize the reconstructed map and improve the accuracy of reconstruction and pose estimation. It can achieve real-time high-quality surface reconstruction of small scenes. Bundlefusion [11] integrated the research ideas of predecessors and proposed a parallel optimization framework that fully utilized sparse features and dense geometry and photometric terms to perform a sparse-to-dense correspondence matching. In terms of pose optimization, They used a local-to-global blocking strategy and added robust tracking ability to recover from tracking failures (i.e., relocalization), which can generate reconstructions of higher quality in real-time compared to offline methods [9]. With the development of advanced deep learning models [12], [13] and artificial intelligence models in multimodal learning [14], [15], applying advanced neural networks to scene reconstruction has also become a significant trend. PointGroup [16] and 3D-MPA [17] applied U-Net and graph convolutional networks to 3D scenes, respectively, achieving segmentation of 3D point clouds. Reference [18] transferred pre-trained ViTs to the RGB-D domain for 3D object recognition, cross-modally fusing the RGB and depth representations co-encoded by ViT. TR3D [19] used fusion modules to transform traditional 3D object detection methods into multimodal detection methods, demonstrating impressive performance improvements. An overview of static 3D reconstruction algorithms is shown in Table 1.

TABLE 1 Overview of RGB-D-based Static 3D Reconstruction Methods: This Report Discusses the Details of Different Algorithms From the Aspects of Camera Tracking, Model Fusion, and Loop Closure, All of Which are Key Technologies for Static 3D Reconstruction Methods Based on RGB-D Cameras

In practical situations, it is inevitable to encounter dynamic objects in a scene, such as people walking or pets playing. Therefore, the assumption of a completely static environment can be easily broken. In this case, not only is the camera moving, but also the dynamic objects in the scene are moving, which makes it difficult to track the camera trajectory and leads to reconstruction failure. Therefore, we must deal with these dynamic objects. Before processing dynamic objects, it is necessary to first identify them. Since dynamic objects in a scene have different motion tendencies than static backgrounds, we can distinguish them by analyzing such motion characteristics [29], [30], [31], [33]. Another method to identify dynamic objects is based on deep learning [32], [34], [35], using prior knowledge and semantic information to directly segment dynamic objects. For camera pose estimation, a straightforward method is to treat the data of dynamic objects as outliers and remove them to eliminate their influence on camera pose [36], [37], [39], [40]. However, direct removal of dynamic objects may result in information loss and affect the quality of scene reconstruction. In contrast, using the features of dynamic objects for pose estimation is more meaningful and beneficial [38], [41]. Additionally, the model fusion strategy for dynamic scene reconstruction is also improved accordingly based on the static fusion strategy [29], [42], [43]. An overview of the dynamic 3D reconstruction algorithm is shown in Table 2.

TABLE 2 Overview of RGB-D-based Dynamic 3D Reconstruction Methods: This Report Discusses the Details of Different Algorithms From the Aspects of Segmentation of Dynamic Objects, Camera Tracking, and Model Fusion, All of Which are Key Technologies for Static 3D Reconstruction Methods Based on RGB-D Cameras

Different from traditional reconstruction methods, [44] proposed an implicit Neural Radiance Field (NeRF) to represent three-dimensional scenes. It utilizes Multilayer Perceptrons (MLP) to learn the 3D information of the scene and can synthesize new viewpoint images through volume rendering. Compared to complex traditional reconstruction processes, NeRF’s reconstruction process is simpler and can provide a more continuous representation of the scene. NeRF’s implicit representation method provides a new direction for indoor 3D reconstruction based on RGB-D cameras, and its implicit scene representation method further improves the quality of scene reconstruction. In static reconstruction, iMAP [23] first demonstrated that MLP can be the sole scene representation in real-time SLAM systems with handheld RGB-D cameras. NICE-SLAM [26] combined hierarchical scene representation and neural implicit representation to achieve real-time, efficient, and detailed RGB-D surface reconstruction in large-scale scenes. Reference [45] effectively utilized RGB-D data by combining implicit functions (truncated signed distance function, TSDF) and volumetric radiance fields, improving the accuracy and completeness of geometric reconstruction. However, due to the use of multilayer perceptrons and complex optimization algorithms, this method has a long computation time and is not suitable for real-time applications. GO-Surf [46] built on [45] by directly optimizing multiresolution feature grids and the signed distance function (SDF) to achieve fast and accurate surface reconstruction. Recently, [47] proposed a 3D Gaussian-based scene representation. It retains the desirable properties of the continuum radiation field while avoiding unnecessary computations in space, and greatly improves its rendering speed while ensuring the rendering quality. In dynamic reconstruction, D-nerf [48] extended the application domain of NeRF from static to dynamic scenes by introducing the time dimension and learning canonical representations of dynamic scenes. Although this method has a high computational complexity, it excels in handling non-rigid motion and generating high-detail images, showcasing the potential of neural radiance fields in dynamic scene applications. Recursive-NeRF [49] introduced uncertainty prediction, recursively passing query points to different levels of neural networks based on complexity to achieve adaptive representation at the detail level, balancing efficiency and quality. Although this method improves computational efficiency, it requires storing multiple levels of neural networks, resulting in high memory consumption, especially when handling large-scale scenes. Reference [50] proposed a new method called NDR (Neural-Dynamic Reconstruction) for recovering high-fidelity geometry and motion from monocular RGB-D cameras in dynamic scenes. Although the method is effective, it has high computational complexity, long training times, and a large demand for computational resources.

SECTION III.

Reconstruction of Static Scenes

As shown in Figure 2, the process of static 3D reconstruction mainly includes depth image enhancement, camera tracking, model fusion, and surface extraction. In this chapter, we will use the basic process of static 3D reconstruction as a framework to introduce the improvements made by different reconstruction algorithms in each step of the reconstruction process.

FIGURE 2.

Overview of the static indoor reconstruction pipeline. The first step is to input RGB-D images and enhance the depth images. The second step is to use RGB-D data for camera tracking, which estimates the camera pose. If camera tracking fails, the camera relocation function is launched to recover from the failure. The third step involves incorporating surface information of the scene into the model using the tracked poses. The fourth step is to extract smooth and dense surfaces using surface extraction algorithms. Finally, camera pose is globally optimized through loop detection and processing.

Show All

A. Depth Image Enhancement

Currently, RGB-D cameras are mainly divided into two types: structured light and time-of-flight (TOF). As shown in Figure 3, Both types of RGB-D cameras can easily acquire color images and depth information. However, the depth images obtained are often marred by “holes” due to factors such as the material and structure of the object being measured, as well as rapid movements of the camera, resulting in data loss. This phenomenon is more common in consumer-level depth cameras. The goal of depth image enhancement is to denoise, refine, and enhance the depth measured initially.

FIGURE 3.

The Principles of Depth Cameras Based on Structured Light (a) and Time-of-Flight (b). Structured light-based approaches involve projecting patterned light onto an object to generate distinct phase information, which is then translated into depth data by a computational unit. Time-of-flight methods determine the distance from the camera to the target object by emitting light pulses and measuring the duration before their reflection is detected.

Show All

Traditional methods such as median filtering, Gaussian smoothing, and bilateral filtering have been used to improve depth image quality. Reference [51] begins with median filtering in the 2D image space for noise reduction, followed by a two-step algorithm employing Gaussian smoothing on the 3D surface to enhance the depth of videos. Additionally, KinectFusion [6] utilizes bilateral filtering [52] to remove noise from original depth images, enhancing image quality. Since the RGB image obtained by the camera is often clear, [53], [54], [55] improve the accuracy of the depth image or supplement the missing parts of the depth image by corresponding the RGB map to the depth image. To achieve smooth object surfaces, [56] reconstructs locally smooth scene segments and deforms them for alignment, effectively addressing high-frequency noise and low-frequency distortion in depth images.With the advancement of super-resolution techniques, this technology has also been applied to enhance the resolution of depth images [57], [58], [59], thereby improving the precision of reconstructions.

In recent years, depth-enhancement algorithms based on deep learning have made great progress. These methods leverage the powerful learning capabilities of neural networks to predict surface normals and occlusion boundaries from RGB maps, which are then merged with depth maps captured by depth cameras to complete the missing parts in the original depth maps. A depth network [60] was developed to forecast object surface normals and occlusion boundaries from RGB maps. This predicted data was subsequently merged with depth maps captured by a depth camera, completing the missing parts in the original depth maps. Moreover, [61] proposed a cascaded CNN structure (DDRNet) to enhance low and high frequency information in deep data. Supervised deep learning approaches often necessitate ground truth data from actual scenes, a requirement that poses significant acquisition challenges. Training networks with synthetic data presents a potential solution. However, the domain transfer issue between synthetic and real data may impair performance. References [62] introduced three methods for unsupervised domain adaptation of a depth denoising network, transitioning from synthetic to real-world data. Addressing the challenge of acquiring real datasets, researchers have turned to unsupervised [63], [64] and self-supervised [65], [66], [67] learning techniques to directly denoise depth maps in the absence of ground truth.

B. Camera Tracking

In a static scene, the scene can be scanned by moving the camera to obtain RGB-D data. For each frame of image captured by the camera, the camera trajectory and poses need to be tracked to fuse the RGB-D data into the model. Nevertheless, the accuracy of this process can be compromised by factors including the precision of algorithms, occlusions, and the velocity of camera motion, necessitating subsequent optimization of the pose estimation.

1) Pose Estimation

a: ICP-Based

To accurately estimate the camera pose between different frames, [68] introduced the Iterative Closest Point (ICP) algorithm. The ICP algorithm is a classical point cloud registration technique that iteratively aligns two or more point clouds to minimize the error between them. Assume there are two sets of points: the target point set $\mathbf {Q} = \{q_{1}, q_{2}, {\dots }, q_{n}\}$ and the source point set $\mathbf {P} = \{p_{1}, p_{2}, {\dots }, p_{m}\}$ . The goal of the ICP algorithm is to find a rotation matrix R and a translation vector t to minimize the following mean squared error:

$\begin{equation*} E(R, t) = \sum _{i=1}^{m} \| Rp_{i} + t - q_{\text {match}(i)} \|^{2} \tag {1}\end{equation*}$ View Source

In this context,

$q_{\text {match}(i)}$

represents the nearest neighbor of

$p_{i}$

, that is, the point in Q closest to

$p_{i}$

. Considering the heavy dependency of the point-to-point Iterative Closest Point (ICP) algorithm on initial values, suboptimal starting points may lead to an increase in the number of iterations or inaccuracies in the results. Therefore, [69] introduces a point-to-plane ICP algorithm (Figure 4), which improves camera positioning by minimizing the sum of squared distances between each source point and the tangent plane of its corresponding target point, thereby accelerating convergence:

$\begin{equation*} E(R, t) = \sum _{i=1}^{m} \left ({{ n_{\text {match}(i)}^{T} (Rp_{i} + t - q_{\text {match}(i)}) }}\right)^{2} \tag {2}\end{equation*}$

View Source

$n_{\text {match}(i)}$

is the normal vector corresponding to

$q_{\text {match}(i)}$

. Building upon these advancements, [70] integrated point-to-point ICP and point-to-plane ICP into a single probabilistic framework to form a new algorithm called GICP, which exhibits more robustness against incorrect matches. Due to the cumulative error generated by frame-to-frame matching, the camera trajectory can experience drift, severely affecting the accuracy of pose estimation. KinectFusion [6] used a frame-to-model matching method, greatly reducing cumulative error. Subsequently, [7], [9], [71], [72], [73], [74], and [75] added dense photometric validation based on ICP geometric registration to further optimize the matching algorithm. When matching two sets of point clouds, in order to consider the local features of point clouds (normal vectors, curvature), [76] defined an error function during the iterative solving process that not only included the projected distance of normal vectors between point clouds, but also the direction error of normal vectors, making pose estimation more robust. In addition, there are some other ICP variants, such as efficient ICP [77], non-rigid ICP [78], etc.

FIGURE 4.

The point-to-plane ICP algorithm. It optimizes the camera’s pose by minimizing the distance between the source point and the tangent plane of the corresponding destination point.

Show All

b: Feature-Based

The ICP algorithm performs well under the assumption of minor motion between frames. However, it tends to converge to local optima during rapid camera movements, where the difference between successive frames is significant. In order to cope with this situation, [79], [80], [81], [82], [83], [84] extracted feature points (SIFT, SURF, ORB) from color images and use these sparse features to quickly match the pose of each frame. ORB-SLAM [82] is a classic feature-based visual SLAM system. It combines FAST feature detection and BRIEF feature description, ensuring the speed of feature point detection and the stability of description. However, it is limited to feature matching between image frames and is unstable under varying lighting conditions and when feature points are missing. ORB-SLAM2 [83] improves pose estimation by supporting stereo and RGB-D sensors, utilizing depth information for further optimization. Building on this, ORB-SLAM3 [84] further improves the pose estimation algorithm, supports the construction and management of multiple maps, and enhances pose estimation accuracy through the collaborative work of multiple sensors. In addition to using point features for matching, useful edge features [85], [86] can also be extracted from depth images to establish corresponding constraints, thereby enhancing the robustness of pose estimation. To ensure the real-time and accuracy of pose tracking, [87] introduced an information-theoretic approach for point selection in RGB-D direct odometry measurements. This approach simplifies the optimization process while maintaining accuracy by identifying and utilizing data points that carry the most information. CPA-SLAM [89] models the environment using a global model composed of planes, which reduces significant image drift. Reference [88] extracts 3D facial landmarks during face reconstruction for model fine-tuning to ensure the accuracy of head pose estimation. To achieve high-precision feature tracking under rapid sensor motion, [24] performed feature tracking within an extended Kalman filter framework. This framework integrates IMU data to better accomplish sensor motion estimation.

c: Hybrid Method

Pose matching algorithms based on ICP usually require aligning the entire point cloud data, resulting in high computational costs, which are not suitable for real-time 3D modeling of large scenes. In contrast, feature-based matching algorithms can better cope with the limitations of real-time requirements. However, feature-based matching algorithms often require dense features, and the reconstruction quality is significantly affected when the number of matching features in the scene decreases [90]. Combining sparse feature matching with ICP is a good method to balance real-time performance and reconstruction quality. BundleFusion [11] first utilizes sparse SIFT features for coarse pose alignment, and then refines the estimated pose using dense photometric and geometric terms similar to the ICP algorithm, achieving real-time accurate pose estimation and solving the real-time issue of high-quality reconstruction. References [91] and [92] combined edge information with the ICP algorithm to enhance robustness and accuracy. Recently, [93] introduced an enhanced 3D scene reconstruction method using Fast Point Feature Histograms (FPFH) and Iterative Closest Point (ICP) techniques. It improves model robustness and accuracy by modifying the weight calculation formula and employing an enhanced FPFH descriptor for initial registration estimation. To further increase the ICP iteration speed, it also utilizes a Best Bin First (BBF) strategy to reduce data dimensionality.

d: NeRF-Based

The advent of NeRF offers a new paradigm for camera pose optimization. These methods leverage the power of neural networks to synthesize novel views and provide accurate pose estimates, significantly enhancing the robustness and accuracy of camera tracking in static scenes. iNeRF [94] estimates the camera pose by inverting NeRF. Specifically, NeRF optimizes the parameters $\Theta$ of the scene using a given set of camera poses T and observed images I, while iNeRF inversely solves the problem of recovering the camera pose T given the weights $\Theta$ and image I as inputs:

$\begin{equation*} \hat {T} = \mathop {\mathrm {arg\,min}} _{T \in SE(3)} \mathcal {L}(T \mid I, \Theta) \tag {3}\end{equation*}$ View Source

To solve this optimization problem, iNeRF obtains some estimated camera poses

$T \in SE(3)$

in the coordinate system of the NeRF model and renders the corresponding image observations. To update the pose T, the same photometric loss function L used in NeRF is employed:

$\begin{equation*} \mathcal {L} = \sum _{r \in R} \| \hat {C}(r) - C(r) \|^{2}_{2} \tag {4}\end{equation*}$

View Source

Here,

$r \in R$

represents a set of sampled rays, and

$C(r)$

is the observed RGB value of the pixel corresponding to ray r in an image. Although iNeRF successfully applied NeRF to pose estimation and achieved excellent results, it still requires an initial pose as a starting point, which affects the convergence of the optimization and the final accuracy. With the development of deep learning, considering the exceptional performance of Generative Adversarial Networks (GANs) in image generation, [95] combines GANs with NeRF to optimize initial pose estimations during reconstruction. It does not rely on known camera poses and can optimize from a completely random initialization, which is particularly useful in uncertain and complex scenes. Additionally, [96] employs a coarse-to-fine camera registration strategy and demonstrates the impact of positional encoding on alignment, effectively optimizing the neural network’s scene representation while addressing pose misalignment issues in large-scale cameras. To further address errors arising from drastic camera movements, [97] introduced an undistorted monocular depth prior based on NeRF and proposes novel loss functions to constrain the relative poses between adjacent frames.

Bundle Adjustment is a technique used to optimize camera parameters and 3D point coordinates in 3D reconstruction. Its main purpose is to improve the accuracy of 3D reconstruction by minimizing the reprojection error. Specifically, the optimization problem can be represented as:

$\begin{equation*} \min _{P, X} \sum _{i,j} \left \|{{ x_{ij} - \pi (P_{i}, X_{j}) }}\right \|^{2} \tag {5}\end{equation*}$ View Source

where

$P_{i}$

represents the parameters of the i-th camera,

$X_{j}$

represents the coordinates of the j-th 3D point,

$x_{ij}$

is the observed 2D coordinate of the j-th 3D point in the i-th camera, and

$\pi (P_{i}, X_{j})$

is the projection of the 3D point

$X_{j}$

onto the 2D image plane through the camera parameters

$P_{i}$

Inspired by the Bundle Adjustment (BA) algorithm, the NBA (Neural Bundle Adjustment) method proposed by [98] optimizes the implicit surface and camera poses without relying on known camera extrinsics. Specifically, NBA updates the 3D point X at each step as follows:

$\begin{equation*} X \leftarrow X - \phi (X) \nabla \phi (X) \tag {6}\end{equation*}$ View Source

where

$\phi (X)$

is the distance value output by the SDF (Signed Distance Field) network in the neural radiance field, and

$\nabla \phi (X)$

is the gradient at that point. After updating the 3D point X, the reprojection error is calculated based on the feature trajectory T, jointly optimizing the SDF network

$\phi$

, the estimated camera poses P, and the updated 3D point set X.

2) Loop Closure

During the pose matching of each frame, both ICP-based and feature-based pose matching algorithms can produce errors, and these errors increase with the number of frames. After the camera completes a full rotation around the scene, accumulated errors can cause a misalignment between the starting and ending points. Therefore, we must process loop closure after pose matching to ensure global consistency and reduce the impact of accumulated errors on reconstruction quality. Reference [99] defined keyframes and registers the frames between them to eliminate local errors, and uses the entropy ratio criterion to check loop closure. Reference [71] utilized efficient Pose graph optimization and Sparse bundle adjustment for global consistency alignment. However, this global optimization distributes the residual error over the entire path, resulting in the destruction of the details of the object surface. To further optimize pose estimation, [9] divides all frames into equally sized blocks with one overlapping frame between adjacent blocks. Each small block is reconstructed first, then the overlap frames are used to register the blocks and detect loop closure, and finally erroneous loops are removed to achieve global consistency in reconstruction. Subsequently, BundleFusion [11] performed sparse-to-dense global pose optimization and solves loop closure by integrating and re-integrating previous RGB-D frames during movement, enabling it to correct all drifts. Although this method produces better positional optimization, it requires a large amount of computational resources. To enhance pose estimation accuracy in large-scale scenes, [100] introduced a two-pass loop closure detection method that integrates global and local image features to identify loop closure candidates. Recently, [24] utilized subgraph-based depth image encoding and 3D graph deformation for loop closure to maintain global consistency in the reconstructed model. Reference [101] introduced local 3D deep descriptors (L3Ds) for loop closure handling. L3Ds are compact representations of patches extracted from point clouds, learned using deep learning algorithms, significantly enhancing loop closure detection accuracy.

Another way to address cumulative errors is to assume the scene structure in the world frame and directly align each tracking frame with the scene structure, rather than with keyframes or the last frame. One of the most common assumptions is the Manhattan assumption [102], [103], which represents the scene using a set of orthogonal planes aligned with the world’s three main axes, simplifying the scene understanding and enabling efficient inference of scene geometry and object position. Structure-SLAM [104] employed a convolutional neural network(CNN) to forecast normals and compute drift-free rotations leveraging geometric features under the Manhattan assumption, effectively addressing low-texture regions in indoor settings. Building upon Structure-SLAM, [105] incorporated planar features within the Manhattan framework and introduced an advanced meshing module for reconstructing scene structures, thereby enhancing localization and mapping accuracy.To make the Manhattan assumption more suitable for real-world scenes, ManhattanSLAM [22] directly detected Manhattan frames(MFs) from planes and modeled the scene as a Mixture of Manhattan Frames (MMF), estimating unbiased rotation by observing MFs across frames.

3) Relocalization

Due to factors such as high camera movement speed or changes in viewpoint, camera tracking may fail. Therefore, the ability to quickly recover and perform relocalization when camera tracking fails is essential in the 3D reconstruction process. There are several methods for camera relocalization, including the following:

a: Keyframe-Based

This method requires defining and storing keyframes. When the camera tracking fails, it needs to query the images and estimate the camera pose by measuring the overall image similarity with a known set of keyframes. Reference [106] explored an effective keyframe-based relocation method. In the stage of determining keyframes, besides the threshold based on the distance in space, a similarity discrimination with previous keyframes was added to avoid collecting redundant information. In order to quickly retrieve candidate poses in case of tracking loss, this method uses an efficient frame encoding based on ferns. Keyframe-based methods can perform camera relocation in real-time, but they rely on matching input images with a keyframe database and cannot re-locate in a new pose.

b: Keypoint-Based

This relocation method mainly utilizes the sparsity of feature points. During successful tracking, feature points are detected in the image, and their corresponding descriptors and positions in the world coordinate system are stored in a database. When camera tracking is lost, the current frame’s key points and descriptors are calculated, and a match is performed against the database. After a successful match, the current image’s pose can be obtained to complete camera relocation [107], [108], [109], [110], [111], [112], [113].The challenges of this method include: (1) the choice of feature point and descriptor calculation method, (2) how to store key points and their corresponding descriptors, and (3) how to perform feature matching between frames. Inspired by the idea of visual Bag-of-words, [108] stored the extracted SIFT feature descriptors into a vocabulary during successful tracking, and utilized the Term Frequency-Inverse Document Frequency (TF-IDF) of the visual words in each node to rank the nodes. When tracking is lost, refined relocation poses are obtained by matching the descriptor set in each node with the descriptors extracted from the query image to recover from tracking failure. Reference [110] proposed using a regression forest to directly predict the 3D correspondence of all pixels in the current image to the scene. Compared with traditional keypoint-based methods, this method does not require explicit detection, description, or matching of key points, making it simpler and faster. However, it must train the regression forest in advance on the scene of interest offline, and cannot achieve real-time camera relocation. Reference [112] overcame the limitation of having to train offline by dynamically adapting pre-trained forests to new scenarios.

c: Hybrid Method

Researchers have integrated keyframes and keypoints to enhance relocation accuracy while maintaining real-time performance. Upon tracking failure, [82] adopted the DBoW2 algorithm [114] to identify matching candidate keyframes, subsequently calculating ORB features within these keyframes and employing the PnP algorithm [115] to alternately estimate the current frame’s pose. Reference [86] merged edge features with the keyframe-based method [106], securing robust loop closure and relocalization capabilities.

C. Model Fusion

The pose matching algorithm calculates the initial pose, which is further enhanced by closed-loop processing. After that, the surfaces of the scene need to be fused into the 3D model according to the camera’s position. Currently, there are mainly two types of surface fusion models used: voxel-based and surfel-based.

1) Voxel-Based

As shown in the figure 5(a), an image can be represented by square pixels in 2D space, and extending the pixels to 3D is a voxel. This can intuitively reflect the shape of an object. Reference [116] was the first to propose using the TSDF (Truncated Signed Distance Function) grid model to fuse depth information on the basis of voxel representation. KinectFusion [6] further applied this model to 3D reconstruction using RGB-D cameras. This method requires fixing the size of the scene before reconstruction, making it difficult to scale the scene. For large-scale scene reconstruction, which requires substantial memory, KinectFusion falls short. Therefore, various scholars have extended the original TSDF voxel model:

FIGURE 5.

(a) is a schematic diagram of 2D pixels and 3D voxels, and (b) is a schematic diagram of the octree structure.

Show All

a: Moving Volume

To overcome the limitations of voxel representation, [7], [72], [117] expanded the reconstruction area to infinite space by moving voxels. Thomas Whelan [7] utilized a cyclic buffer data structure to effectively recycle GPU memory, addressing the issue of insufficient memory for large-scale scene reconstruction with voxel models. The algorithm enabled camera translation and rotation in the real world, incrementally enlarging the reconstructed surface. Reference [117] proposed the Moving Volume KinectFusion method, which establishes a TSDF buffer and a swap buffer. Utilizing a double buffering mechanism to map between volumetric models during camera movement, the method allows for online processing of volume rotations and translations through voxel interpolation.

b: Octree-Based

The geometry of most objects is very sparse with respect to the whole scene body, which means that the voxels in the TSDF model are mostly empty and all this storage space is wasted. The octree structure is a data model first proposed by Dr. Hunter [118] in 1978. As shown in 6(b), this structure can effectively utilize memory by dividing the scene space, thereby improving storage efficiency. Although its definition is simple, it is difficult to maintain the parallelism of the GPU due to the sparsity of its nodes. Reference [20] and [119], designed a novel octree data structure to improve the reconstruction update and surface prediction parts of KinectFusion, which can fully utilize the parallelism of the GPU, greatly improving storage efficiency and further expanding the reconstruction scale. To reduce memory consumption, [99] fused the acquired depth and color information into a multiscale octree representation of a signed distance function, which can maintain low memory usage while achieving high accuracy. To further improve storage efficiency, [120] defined an octree data structure that supports volume multiresolution 3D mapping and mesh partitioning, reducing memory consumption by only allocating units close to the surface.

c: Voxel Hashing

Although the octree structure can improve the storage efficiency of the model to some extent, complex octree structures still have additional computational complexity and pointer overhead. A simple spatial hashing scheme is used in [8] to compress space, which allows data to flow efficiently in and out of a hash table, enabling real-time access and updates of surface data in the scene without the need for complex hierarchical data structures. Voxel hashing has been widely used in real-time 3D reconstruction [21], [121], [122].

d: Deep Learning-Based

The ability of neural networks to learn rich prior knowledge provides new directions for the development of scene representation. When exploring cluttered indoor scenes with an RGB-D camera, [123] initialized the truncated signed distance function (TSDF) reconstruction of each object with compact instance segmentation using Mask-RCNN, which resulted in a resolution related to object size and novel 3D foreground masks. Reference [124]reconstructed real-time scenes with both geometry and semantic information by incorporating semantic predictions from neural networks into the voxel-based model built on Voxel hashing.

2) Surfel-Based

Voxel-based methods are expensive for handling loop closures in real-time 3D reconstruction because precise compensation may involve changing the entire volume. Moreover, the size of the voxel volume is typically fixed in practice, which limits the adaptivity of representation. If an object is relatively small or thin compared to the voxel size, it can seriously affect the reconstruction quality.

In surfel-based scenes (Figure 6). This representation has the following advantages: (1) Flexibility. When performing point fusion updates, the data is updated using weighted fusion, where the radius of the surface patch is related to the distance between the camera center and the scene surface. The farther the distance, the larger the radius of the surface patch, and this updating method can effectively reconstruct the entire surface. (2) High adaptability. It can measure more densely distributed points at high resolution. (3) It can easily handle thin objects.

FIGURE 6.

Representation of object surface by Surfels. The position information, radius of the surface patch, normal vector, color information, and time information of each point are stored.

Show All

Reference [125] first introduced the concept of surfels and provide a detailed description of surface scenes. ElasticFusion [10] utilized this representation for real-time dense scene reconstruction. The system continuously optimizes the reconstructed map to improve the accuracy of reconstruction and pose estimation, and employs Random Ferns to detect loop closures for global consistency.

With the development of deep learning, in order to fully utilize the semantic information of the scene, Semanticfusion [127] combined CNN and ElasticFusion to successfully fuse semantic predictions from multiple viewpoints into a surfel-based representation. Reference [126] proposed an indoor RGB-D image semantic segmentation network with multi-scale feature fusion based on ElasticFusion. It integrates the visual color features and depth geometric features of RGB-D images, improving the accuracy of image semantic segmentation. The segmentation results are shown in Figure 7. DeepSurfels [128] integrated feature information learned from RGB images with detailed representations of facets, making it possible to reconstruct large-scale scenes in real-time. To obtain high-quality surface texture, [129] employed Shape-from-Shading (SfS) and spatially-varying spherical harmonics (SVSH) techniques to simultaneously optimize geometry, texture, and camera poses. The main drawback of the surfel representation is its discreteness, which can be addressed by the meshing approach. Reference [130] created a triangle mesh and performs real-time mesh reconstruction from RGB-D video, which works well for reconstructing thin objects. However, this method requires camera poses as additional input. In contrast, [25] utilized Hermite Radial Basis Functions (HRBF) implicits for direct camera tracking and RGB-D reconstruction, which is a dynamic surface representation that effectively reduces the influence of noise and reconstructs using surface photometric constraints.

FIGURE 7.

Indoor semantic segmentation results. Figure taken from [126].

Show All

3) NeRF-Based

Explicit representations like voxel and surfel allow real-time scene reconstruction, but they face challenges in mapping accuracy and balancing memory consumption. Moreover, they lack in novel view synthesis capabilities. In recent years, with the introduction of NeRF, implicit representations have overcome limitations associated with explicit representations, generating high-fidelity reconstructions with reduced memory usage. These implicit representations achieve this by continuously querying scene properties to generate high-quality images from novel viewpoints. IMAP [23] demonstrated for the first time that Multi-Layer Perceptrons (MLPs) can serve as the sole scene representation in real-time SLAM systems using handheld RGB-D cameras, utilizing keyframe structures and multi-processing computation flows. Reference [131] utilized the point cloud provided by COLMAP and reprojection errors to enforce depth constraints in NeRF, effectively enhancing the rendering speed and reconstruction quality of NeRF. To reduce computational costs and enhance scalability, NICE-SLAM [26] applied the hierarchical scene representation concept to NeRF. However, due to the local updates performed by NICE-SLAM’s feature grid, it fails to achieve reasonable hole filling. Co-SLAM [27] combined coordinate and sparse parameter encoding for scene representation and employed dense global bundle adjustment using rays sampled from all keyframes. Simultaneously, [132] proposed a NeRF-based mapping approach using a hierarchical hybrid representation, leveraging implicit multiresolution hash encoding and explicit octree Signed Distance Function (SDF) priors to describe scenes at different detail levels, achieving real-time high-fidelity dense mapping and dynamic expansion capabilities. Since NeRF does not reconstruct actual surfaces and pseudo-shadows occur when using Marching Cubes to extract voxel-based surfaces, [45] used Truncated Signed Distance Functions (TSDF) to represent surfaces, extending them to commodity RGB-D sensors to reconstruct high-quality 3D scenes. Recently, NGEL-SLAM [133] employed a sparse octree grid integrated with implicit neural maps, ensuring memory efficiency and precise environmental depiction.

D. Surface Extraction

Once the surface information of a scene has been fused into a model based on the camera pose, a surface extraction algorithm is required to obtain a visual representation of the surface. Depending on how the reconstructed scene is represented and stored, surface extraction algorithms can be classified into raycasting and marching Cubes.

1) Raycasting

The surface extraction method proposed by [134] based on Raycasting primarily involves using rays emitted from the camera center and passing through pixels to project onto the object surface to find the iso-surface.

The basic process of this method is as follows (Figure 8): firstly, a ray is projected along the viewing direction from each pixel on the image plane, which passes through the surface of the object. Then, sampling is performed at a certain step size, and linear interpolation algorithms are used to find the intersection point with the surface. This essentially means checking the value of the truncated signed distance function at each voxel along the ray until the first zero-crossing is found. This algorithm is widely used for surface extraction in voxel models [6], [7], [8], [11], [117], [129].

FIGURE 8.

Diagram of Raycasting. It detects and calculates intersections with objects in the scene by casting rays, thereby extracting surface information.

Show All

2) Marching Cubes

The marching cubes algorithm was initially proposed by Lorensen [135], who divided the three-dimensional geometry into small cubes called voxels and defined the voxels using scalar values at the eight corners of each cube. As shown in the Figure 9, if the data value at a vertex of the cube is greater than or equal to the value of the surface we are constructing, the vertex is assigned a value of 1, and 0 otherwise. Under this assumption, when the surface intersects with a cube, the intersection points between the isosurface and the edges of the cube are calculated using interpolation, and then the intersection points of each edge are connected in a certain way to represent the isosurface inside the cube. After finding the isosurface passing through this cube, move to the next cube to continue searching for the isosurface. This is the process of extracting the surface using the marching cubes algorithm.

FIGURE 9.

Diagram of Marching Cubes.

Show All

Due to the huge amount of storage required to reconstruct high-quality models, the octree storage method has been applied to the reconstruction models to get rid of the limitations of commercial computer memory. However, extracting the reconstructed surface from the octree representation is more complicated than extracting it from a regular voxel grid. Reference [136] and [137] extended the Marching Cubes algorithm by addressing the inconsistencies that arise when adjacent leaf nodes in the octree have different depths. Reference [138] proposed a method of marking edges with Hermite data to generate signed grid contours, and extended this method to octrees. By aligning the vertices of the dual grid with the characteristics of the implicit function, [139] can extract iso-surfaces that capture small, thin, and even sharp features in the surface without excessively refining the octree. Reference [140] introduced the concept of an edge tree to provide a method for directly extracting watertight mesh without restricting the octree topology or modifying vertex values.

With the application of deep learning and NeRF technology in 3D reconstruction, [141] proposed a data-driven method called Neural Marching Cubes (NMC) for extracting triangular meshes from discrete implicit fields. This method addresses the shortcomings of traditional surface extraction methods in recovering geometric features such as sharp edges and smooth curves. Specifically, NMC redesigns the mesh subdivision template and introduces neural networks to learn vertex positions and mesh topology, thereby better preserving geometric features. Recently, [142] proposed another method called NeuralMeshing. This method generates meshes iteratively, making it suitable for shapes of various scales and capable of adapting to local curvature, thereby significantly improving the quality of surface extraction.

In conclusion, the integration of deep learning into static 3D reconstruction has brought significant advancements, providing robust and accurate solutions for depth image enhancement, camera tracking, model fusion, and surface extraction. These methods leverage the powerful learning capabilities of neural networks to improve the quality and efficiency of 3D reconstructions, offering new perspectives and directions for future research in this field.

SECTION IV.

Reconstruction of Dynamic Scenes

Dynamic scenes consist of both dynamic objects and static backgrounds. Figure 10 illustrates the general process of dynamic reconstruction. In the following chapter, we will discuss recent developments in dynamic 3D reconstruction, focusing on three aspects: segmentation of dynamic objects, Camera tracking, and model fusion.

FIGURE 10.

Overview of the dynamic indoor reconstruction pipeline. The first step is data acquisition, which is the same as in static 3D reconstruction. The second step is data preprocessing, which involves not only denoising the raw image data but also separating the dynamic objects from the scene. Then, camera tracking is performed using the static background information to align the current frame data with the previous frame or model, finding the correspondences between them and reconstructing the static background. Meanwhile, the dynamic objects are reconstructed separately. Finally, the dynamic objects and static background are merged to complete the reconstruction of the entire scene.

Show All

A. Segmentation of Dynamic Objects

In contrast to static 3D reconstruction, dynamic scenes contain freely moving objects that significantly affect camera pose estimation. Moreover, entities such as human beings and animals undergo non-rigid deformations while in motion. To handle the reconstruction of these dynamic objects, the first step is to distinguish between dynamic and static features, a process known as motion segmentation. Various approaches, including motion analysis-based methods and deep learning-based methods, are employed for motion segmentation in the scene to identify the dynamic characteristics.

1) Motion Analysis-Based Methods

Methods based on motion analysis separate dynamic objects from the static background by detecting object movement within the scene. Examples of such methods include geometric methods, optical flow methods, etc. DynamicFusion [29] utilized geometric features to separate dynamic objects and defined a canonical model specifically for reconstructing non-rigidly deforming dynamic objects. The canonical model was transformed to the live frame using voxel deformation fields. This method addresses the deformation issues of dynamic objects during motion, enhancing the robustness and accuracy of reconstruction. Similar to this approach, Nerfies [143] enhanced NeRF by optimizing an additional continuous volume deformation field, which warps each observed point into a canonical 5D NeRF representation. D-NeRF [48] incorporated time as an additional input to the system and divides the learning process into two main stages: one stage encodes the scene into canonical space, and another stage maps the canonical representation into a deformed scene specific to a particular time. VolumeDeform [30] combined SIFT features extracted from RGB images with depth maps for motion tracking, enhancing the robustness of matching point recognition. References [33] and [144] applied K-means clustering to perform visual clustering and assigned static weights to each clustered pixel or point. Reference [145] estimated static weights based on the distances between corresponding point and line features and applied filtering to the data related to dynamic targets using these static weights, achieving precise localization and tracking of the targets. Reference [146] constructed a foreground model based on the mutual motion between two frames and combined it with RGB-D frame information to segment dynamic and static feature points. Reference [36] initially employed a simple and efficient clustering algorithm to group spatially and appearance-related pixels of each keyframe into different regions, then identified Candidate Dynamic Keypoints (CDK) in consecutive frames with large reprojection errors and recognized regions with a high CDK ratio as dynamic regions. Reference [147] observed that regardless of camera movement, the triangles formed by any three fixed points on a static object remain fixed, and these triangles formed by the three points in different camera coordinate systems are similar. Therefore, the authors determined whether a feature point is static or dynamic by comparing the similarity of the triangles formed by three sets of feature points in two keyframes. Reference [148] introduced a grid-based feature extraction approach that enables fast and efficient extraction of high-quality FAST feature points. Additionally, it combined inertial measurement units for motion prediction, achieving feature tracking and motion consistency detection.

2) Deep Learning-Based Methods

Unlike traditional methods based on motion analysis, deep learning-based methods can learn semantic information as priors from training datasets, and the extraction of semantic information through various image processing techniques has different impacts on dynamic scene problems. Currently, many methods use semantics to make motion segmentation more robust [35], [39], [43], [149], [150], [151], [152], [153], [154], [155], [156], [157], [158]. These methods employ deep neural networks for semantic segmentation and object recognition on RGB-D images to achieve dynamic object detection and tracking.

Mask-RCNN [39] is an instance-level segmentation algorithm based on images that can provide prior information on dynamic objects in a scene. As shown in Figure 11(a), it can provide bounding boxes for dynamic objects. However, within the bounding boxes, some static background areas are classified as dynamic foreground areas, and some dynamic foreground objects are classified as static background. In RGB-D data, utilizing the depth difference between dynamic regions and static background can better optimize segmentation results. Additionally, compared to smooth areas within objects, the normal and variance differences at the boundary between dynamic regions and their adjacent static background are also greater. Therefore, [39] uses connected component analysis to optimize the segmentation results. Specifically, the dynamic weight of a pixel is given by the following formula:

$\begin{equation*} \mathcal {O} = {\mathcal {O}}_{d} + {\mathcal {O}}_{o} + \gamma _{1} {\mathcal {O}}_{e} \tag {7}\end{equation*}$ View Source

where

${\mathcal {O}}_{d}$

is the depth difference,

${\mathcal {O}}_{\sigma }$

is the variance,

$\gamma _{1}$

is the weight for the normal difference

${\mathcal {O}}_{e}$

, and is obtained by the following formula:

$\begin{align*} {\mathcal {O}}_{d} & = \max _{i \in N} |(\mathbf {v}_{i} - \mathbf {v}) \cdot \mathbf {n}| \tag {8}\\ {\mathcal {O}}_{e} & = \max _{i \in N} \begin{cases} \displaystyle 0, & \text {if} ((\mathbf {v}_{i} - \mathbf {v}) \cdot \mathbf {n}) \lt 0 \\ \displaystyle 1 - (\mathbf {n}_{i} \cdot \mathbf {n}), & \text {else} \end{cases} \tag {9}\\ {\mathcal {O}}_{\sigma }& = \sqrt {\frac {1}{N} \sum _{i=1}^{N} (\mathbf {v}_{i} - \mathbf {v})^{2}} \tag {10}\end{align*}$

View Source

where v represents the point on the depth map, n is the normal of that point, N denotes the set of neighborhood point indices for v, and

$\mathbf {v}_{i}$

represents the neighboring points of v. Figure 11(b) shows the results optimized using the connected component method based on the value of

$\mathcal {O}$

FIGURE 11.

Segmentation of dynamic regions. (a) is the result of Mask-RCNN. (b) is the result optimized using the connected component analysis method. Figure taken from [39].

Show All

Additionally, PSPNet-SLAM [151] and DDL-SLAM [152] use PSP-Net and DU-Net respectively as deep learning (DL) models for segmenting dynamic scenes and static backgrounds. However, these segmentation methods using DL models require high memory consumption and computational cost. To improve the real-time performance of the reconstruction system, LRD-SLAM [153] proposed a fast deep convolutional neural network (FNet) for semantic segmentation, which can quickly and accurately identify pedestrian information in a given scene.

Moreover, [154] found that most classic semantic SLAM methods generate semantic results for each frame individually, such as DynaSLAM [155], DS-SLAM [156], and DM-SLAM [35], leading to redundant operations. Since the input to visual SLAM is a sequence of continuous frames, the segmentation results of consecutive frames have many similarities, making it unnecessary to segment each frame. Reference [154] segments only the keyframes and propagates the segmentation results of the keyframes to their adjacent frames, significantly avoiding the time delay caused by segmenting each frame while ensuring segmentation accuracy. The experimental results are shown in Figure 12. Recently, [43] and [157] employed YOLO v5 for detecting dynamic objects in the scene, further enhancing segmentation accuracy. DDN-SLAM [158] leveraged deep semantic system priors and conditional probability fields for effective segmentation. Through the creation of depth-guided static masks and the use of joint multi-resolution hashing encoding, it achieves rapid hole filling and superior mapping quality, effectively reducing the impact of dynamic information.

FIGURE 12.

The propagation of dynamic probabilities during tracking, where green points indicate feature points with initial dynamic probabilities, blue points represent identified static feature points, and red points represent dynamic feature points. Figure taken from [154].

Show All