Loading web-font TeX/Main/Bold
A Survey of Indoor 3D Reconstruction Based on RGB-D Cameras | IEEE Journals & Magazine | IEEE Xplore

A Survey of Indoor 3D Reconstruction Based on RGB-D Cameras


This graphic abstract provides an overview of indoor 3D reconstruction using RGB-D cameras. It highlights the distinction between static and dynamic indoor environments, ...

Abstract:

With the advancement of consumer-grade RGB-D cameras, obtaining depth information for indoor 3D spaces has become increasingly accessible. This paper systematically revie...Show More

Abstract:

With the advancement of consumer-grade RGB-D cameras, obtaining depth information for indoor 3D spaces has become increasingly accessible. This paper systematically reviews 3D reconstruction algorithms for indoor scenes using these cameras, serving as a reference for future research. We cover reconstruction processes and optimization algorithms for both static and dynamic scenes. Additionally, we discuss commonly used datasets, evaluation metrics, and the performance of various reconstruction algorithms. Findings indicate that the balance between reconstruction quality and speed in static scene reconstruction, as well as deformation, occlusion, and fast motion of objects in dynamic scenes are currently major concerns. Deep learning and Neural Radiance Fields (NeRF) are poised to provide new perspectives and methods to address these challenges.
This graphic abstract provides an overview of indoor 3D reconstruction using RGB-D cameras. It highlights the distinction between static and dynamic indoor environments, ...
Published in: IEEE Access ( Volume: 12)
Page(s): 112742 - 112766
Date of Publication: 13 August 2024
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

In recent years, with the rapid development of computer vision and artificial intelligence technologies, the application of 3D reconstruction technology in indoor environments has received widespread attention. Depending on the input data, 3D reconstruction algorithms can be divided into RGB-D camera-based, stereo array-based, visual-inertial-based, and monocular pure RGB-based types. Compared to other reconstruction methods, the RGB-D camera-based approach can directly obtain the color and depth information of each pixel, reducing the complex depth calculation process and more accurately capturing the geometric structure of objects. Additionally, this method has advantages such as high reconstruction accuracy, fast speed, and high system integration, making it very suitable for complex indoor reconstruction tasks. With the advent of consumer-grade depth cameras like Kinect and RealSense, their lower cost and higher real-time performance have greatly promoted the development and application of indoor 3D reconstruction. In smart home systems, using RGB-D cameras for indoor 3D reconstruction can generate accurate home models, enhancing the automation and intelligence levels of smart home systems. In logistics and warehouse management, it can increase the automation level of warehouse management and logistics operations, improving efficiency and accuracy. In robot navigation applications, using RGB-D cameras to generate 3D maps of the environment can enhance the robot’s autonomous navigation capabilities and task execution efficiency, improving its adaptability in complex environments. This has led many scholars to research indoor 3D reconstruction algorithms based on RGB-D cameras.

Reference [1] provides a comprehensive overview of the application of visual odometry and visual SLAM in the field of mobile robotics, discussing various sensor data fusion methods and emphasizing their application in actual robot navigation. Reference [2] summarizes the latest advancements in indoor scene modeling and discusses public datasets and programming libraries, but the technologies covered are only up to 2015. Reference [3] divides indoor scenes into static and dynamic scenes, mainly focusing on summarizing traditional reconstruction algorithms, with less attention to emerging deep learning methods. Reference [4] discusses in detail the working principles, applications, and role of RGB-D cameras in 3D reconstruction, introducing relevant datasets and future research directions. However, it lacks specific performance comparisons of the latest algorithms and technologies. Reference [5] is a recent review article on the latest indoor reconstruction algorithms, covering various RGB-D camera technologies and their application scenarios, but this article mainly focuses on static 3D reconstruction algorithms, with less consideration for applications in dynamic environments.

As we can see, some scholars have already summarized indoor reconstruction algorithms based on RGB-D cameras, but these studies have their limitations. Moreover, the development of deep learning technologies and neural radiance fields (NeRF) has also provided new directions for this field. Therefore, it is necessary to comprehensively review and summarize the applications of RGB-D cameras in indoor 3D reconstruction, providing a systematic knowledge framework to help researchers quickly understand the latest advancements and key technologies in this field. The main contributions of this paper are as follows: First, we classify indoor 3D reconstruction algorithms based on RGB-D cameras into static and dynamic scenes, revealing the advantages and disadvantages of each method through classification and comparison of different technical approaches, aiding researchers in choosing the most suitable technical path and optimizing existing methods. Second, we summarize the general process of static and dynamic 3D reconstruction algorithms and outline different reconstruction algorithms at each stage. Third, we update the applications of deep learning and neural radiance fields in this field, analyzing their advantages and disadvantages, providing new directions and solutions for future research. Fourth, we provide comprehensive RGB-D datasets and evaluation standards, offering reliable resources and tools to facilitate researchers in technical validation and performance comparison.

The main structure of this paper is as follows: Section II briefly reviews the development history of major static and dynamic 3D reconstruction algorithms in recent years. Section III provides a description of static scene reconstruction, dividing the reconstruction pipeline into different steps, and detailing the optimizations made by various researchers in each step. Section IV focuses on three-dimensional reconstruction of dynamic scenes, with the processing of dynamic objects being the main research content of this chapter. Section V introduces the datasets and evaluation metrics used in the research process. Finally, Section VI summarizes this paper.

SECTION II.

Related Work

In this subsection, we briefly review the development history of the main algorithms for indoor 3D reconstruction based on RGB-D cameras. Figure 1 shows some classic algorithms organized according to the timeline.

FIGURE 1. - The research history of 3D reconstruction in static and dynamic scenes.
FIGURE 1.

The research history of 3D reconstruction in static and dynamic scenes.

In static scenes, RGB-D cameras are the only moving objects. By capturing the camera’s trajectory, we can fuse the obtained depth data into a reconstructed model, and then extract the surface to generate a static 3D model. Reference [6] proposed the first algorithm, KinectFusion, which utilizes an RGB-D camera for real-time 3D reconstruction. Additionally, they outlined a typical reconstruction pipeline for static 3D reconstruction, comprising depth map processing, camera pose estimation, scene reconstruction, and surface extraction. However, KinectFusion is limited by the voxel model and memory, and it can only perform reconstruction on small scenes. Kintinuous [7] extended KinectFusion to large scene reconstruction by moving the voxel model. In addition, it integrated loop detection and optimization, greatly improving the reconstruction quality. Moreover, Voxelhashing [8] employed voxel hashing as a model storage approach, significantly enhancing storage efficiency. Redwood [9] used an offline method to segment the input RGB-D sequence and separately reconstructed each segment, and then used the keyframes that overlap between the segments to register them, thereby reducing the accumulated error and obtaining high-quality 3D models. These methods are all based on voxel models. ElasticFusion [10] creatively used surfel representation to continuously optimize the reconstructed map and improve the accuracy of reconstruction and pose estimation. It can achieve real-time high-quality surface reconstruction of small scenes. Bundlefusion [11] integrated the research ideas of predecessors and proposed a parallel optimization framework that fully utilized sparse features and dense geometry and photometric terms to perform a sparse-to-dense correspondence matching. In terms of pose optimization, They used a local-to-global blocking strategy and added robust tracking ability to recover from tracking failures (i.e., relocalization), which can generate reconstructions of higher quality in real-time compared to offline methods [9]. With the development of advanced deep learning models [12], [13] and artificial intelligence models in multimodal learning [14], [15], applying advanced neural networks to scene reconstruction has also become a significant trend. PointGroup [16] and 3D-MPA [17] applied U-Net and graph convolutional networks to 3D scenes, respectively, achieving segmentation of 3D point clouds. Reference [18] transferred pre-trained ViTs to the RGB-D domain for 3D object recognition, cross-modally fusing the RGB and depth representations co-encoded by ViT. TR3D [19] used fusion modules to transform traditional 3D object detection methods into multimodal detection methods, demonstrating impressive performance improvements. An overview of static 3D reconstruction algorithms is shown in Table 1.

TABLE 1 Overview of RGB-D-based Static 3D Reconstruction Methods: This Report Discusses the Details of Different Algorithms From the Aspects of Camera Tracking, Model Fusion, and Loop Closure, All of Which are Key Technologies for Static 3D Reconstruction Methods Based on RGB-D Cameras
Table 1- Overview of RGB-D-based Static 3D Reconstruction Methods: This Report Discusses the Details of Different Algorithms From the Aspects of Camera Tracking, Model Fusion, and Loop Closure, All of Which are Key Technologies for Static 3D Reconstruction Methods Based on RGB-D Cameras

In practical situations, it is inevitable to encounter dynamic objects in a scene, such as people walking or pets playing. Therefore, the assumption of a completely static environment can be easily broken. In this case, not only is the camera moving, but also the dynamic objects in the scene are moving, which makes it difficult to track the camera trajectory and leads to reconstruction failure. Therefore, we must deal with these dynamic objects. Before processing dynamic objects, it is necessary to first identify them. Since dynamic objects in a scene have different motion tendencies than static backgrounds, we can distinguish them by analyzing such motion characteristics [29], [30], [31], [33]. Another method to identify dynamic objects is based on deep learning [32], [34], [35], using prior knowledge and semantic information to directly segment dynamic objects. For camera pose estimation, a straightforward method is to treat the data of dynamic objects as outliers and remove them to eliminate their influence on camera pose [36], [37], [39], [40]. However, direct removal of dynamic objects may result in information loss and affect the quality of scene reconstruction. In contrast, using the features of dynamic objects for pose estimation is more meaningful and beneficial [38], [41]. Additionally, the model fusion strategy for dynamic scene reconstruction is also improved accordingly based on the static fusion strategy [29], [42], [43]. An overview of the dynamic 3D reconstruction algorithm is shown in Table 2.

TABLE 2 Overview of RGB-D-based Dynamic 3D Reconstruction Methods: This Report Discusses the Details of Different Algorithms From the Aspects of Segmentation of Dynamic Objects, Camera Tracking, and Model Fusion, All of Which are Key Technologies for Static 3D Reconstruction Methods Based on RGB-D Cameras
Table 2- Overview of RGB-D-based Dynamic 3D Reconstruction Methods: This Report Discusses the Details of Different Algorithms From the Aspects of Segmentation of Dynamic Objects, Camera Tracking, and Model Fusion, All of Which are Key Technologies for Static 3D Reconstruction Methods Based on RGB-D Cameras

Different from traditional reconstruction methods, [44] proposed an implicit Neural Radiance Field (NeRF) to represent three-dimensional scenes. It utilizes Multilayer Perceptrons (MLP) to learn the 3D information of the scene and can synthesize new viewpoint images through volume rendering. Compared to complex traditional reconstruction processes, NeRF’s reconstruction process is simpler and can provide a more continuous representation of the scene. NeRF’s implicit representation method provides a new direction for indoor 3D reconstruction based on RGB-D cameras, and its implicit scene representation method further improves the quality of scene reconstruction. In static reconstruction, iMAP [23] first demonstrated that MLP can be the sole scene representation in real-time SLAM systems with handheld RGB-D cameras. NICE-SLAM [26] combined hierarchical scene representation and neural implicit representation to achieve real-time, efficient, and detailed RGB-D surface reconstruction in large-scale scenes. Reference [45] effectively utilized RGB-D data by combining implicit functions (truncated signed distance function, TSDF) and volumetric radiance fields, improving the accuracy and completeness of geometric reconstruction. However, due to the use of multilayer perceptrons and complex optimization algorithms, this method has a long computation time and is not suitable for real-time applications. GO-Surf [46] built on [45] by directly optimizing multiresolution feature grids and the signed distance function (SDF) to achieve fast and accurate surface reconstruction. Recently, [47] proposed a 3D Gaussian-based scene representation. It retains the desirable properties of the continuum radiation field while avoiding unnecessary computations in space, and greatly improves its rendering speed while ensuring the rendering quality. In dynamic reconstruction, D-nerf [48] extended the application domain of NeRF from static to dynamic scenes by introducing the time dimension and learning canonical representations of dynamic scenes. Although this method has a high computational complexity, it excels in handling non-rigid motion and generating high-detail images, showcasing the potential of neural radiance fields in dynamic scene applications. Recursive-NeRF [49] introduced uncertainty prediction, recursively passing query points to different levels of neural networks based on complexity to achieve adaptive representation at the detail level, balancing efficiency and quality. Although this method improves computational efficiency, it requires storing multiple levels of neural networks, resulting in high memory consumption, especially when handling large-scale scenes. Reference [50] proposed a new method called NDR (Neural-Dynamic Reconstruction) for recovering high-fidelity geometry and motion from monocular RGB-D cameras in dynamic scenes. Although the method is effective, it has high computational complexity, long training times, and a large demand for computational resources.

SECTION III.

Reconstruction of Static Scenes

As shown in Figure 2, the process of static 3D reconstruction mainly includes depth image enhancement, camera tracking, model fusion, and surface extraction. In this chapter, we will use the basic process of static 3D reconstruction as a framework to introduce the improvements made by different reconstruction algorithms in each step of the reconstruction process.

FIGURE 2. - Overview of the static indoor reconstruction pipeline. The first step is to input RGB-D images and enhance the depth images. The second step is to use RGB-D data for camera tracking, which estimates the camera pose. If camera tracking fails, the camera relocation function is launched to recover from the failure. The third step involves incorporating surface information of the scene into the model using the tracked poses. The fourth step is to extract smooth and dense surfaces using surface extraction algorithms. Finally, camera pose is globally optimized through loop detection and processing.
FIGURE 2.

Overview of the static indoor reconstruction pipeline. The first step is to input RGB-D images and enhance the depth images. The second step is to use RGB-D data for camera tracking, which estimates the camera pose. If camera tracking fails, the camera relocation function is launched to recover from the failure. The third step involves incorporating surface information of the scene into the model using the tracked poses. The fourth step is to extract smooth and dense surfaces using surface extraction algorithms. Finally, camera pose is globally optimized through loop detection and processing.

A. Depth Image Enhancement

Currently, RGB-D cameras are mainly divided into two types: structured light and time-of-flight (TOF). As shown in Figure 3, Both types of RGB-D cameras can easily acquire color images and depth information. However, the depth images obtained are often marred by “holes” due to factors such as the material and structure of the object being measured, as well as rapid movements of the camera, resulting in data loss. This phenomenon is more common in consumer-level depth cameras. The goal of depth image enhancement is to denoise, refine, and enhance the depth measured initially.

FIGURE 3. - The Principles of Depth Cameras Based on Structured Light (a) and Time-of-Flight (b). Structured light-based approaches involve projecting patterned light onto an object to generate distinct phase information, which is then translated into depth data by a computational unit. Time-of-flight methods determine the distance from the camera to the target object by emitting light pulses and measuring the duration before their reflection is detected.
FIGURE 3.

The Principles of Depth Cameras Based on Structured Light (a) and Time-of-Flight (b). Structured light-based approaches involve projecting patterned light onto an object to generate distinct phase information, which is then translated into depth data by a computational unit. Time-of-flight methods determine the distance from the camera to the target object by emitting light pulses and measuring the duration before their reflection is detected.

Traditional methods such as median filtering, Gaussian smoothing, and bilateral filtering have been used to improve depth image quality. Reference [51] begins with median filtering in the 2D image space for noise reduction, followed by a two-step algorithm employing Gaussian smoothing on the 3D surface to enhance the depth of videos. Additionally, KinectFusion [6] utilizes bilateral filtering [52] to remove noise from original depth images, enhancing image quality. Since the RGB image obtained by the camera is often clear, [53], [54], [55] improve the accuracy of the depth image or supplement the missing parts of the depth image by corresponding the RGB map to the depth image. To achieve smooth object surfaces, [56] reconstructs locally smooth scene segments and deforms them for alignment, effectively addressing high-frequency noise and low-frequency distortion in depth images.With the advancement of super-resolution techniques, this technology has also been applied to enhance the resolution of depth images [57], [58], [59], thereby improving the precision of reconstructions.

In recent years, depth-enhancement algorithms based on deep learning have made great progress. These methods leverage the powerful learning capabilities of neural networks to predict surface normals and occlusion boundaries from RGB maps, which are then merged with depth maps captured by depth cameras to complete the missing parts in the original depth maps. A depth network [60] was developed to forecast object surface normals and occlusion boundaries from RGB maps. This predicted data was subsequently merged with depth maps captured by a depth camera, completing the missing parts in the original depth maps. Moreover, [61] proposed a cascaded CNN structure (DDRNet) to enhance low and high frequency information in deep data. Supervised deep learning approaches often necessitate ground truth data from actual scenes, a requirement that poses significant acquisition challenges. Training networks with synthetic data presents a potential solution. However, the domain transfer issue between synthetic and real data may impair performance. References [62] introduced three methods for unsupervised domain adaptation of a depth denoising network, transitioning from synthetic to real-world data. Addressing the challenge of acquiring real datasets, researchers have turned to unsupervised [63], [64] and self-supervised [65], [66], [67] learning techniques to directly denoise depth maps in the absence of ground truth.

B. Camera Tracking

In a static scene, the scene can be scanned by moving the camera to obtain RGB-D data. For each frame of image captured by the camera, the camera trajectory and poses need to be tracked to fuse the RGB-D data into the model. Nevertheless, the accuracy of this process can be compromised by factors including the precision of algorithms, occlusions, and the velocity of camera motion, necessitating subsequent optimization of the pose estimation.

1) Pose Estimation

a: ICP-Based

To accurately estimate the camera pose between different frames, [68] introduced the Iterative Closest Point (ICP) algorithm. The ICP algorithm is a classical point cloud registration technique that iteratively aligns two or more point clouds to minimize the error between them. Assume there are two sets of points: the target point set \mathbf {Q} = \{q_{1}, q_{2}, {\dots }, q_{n}\} and the source point set \mathbf {P} = \{p_{1}, p_{2}, {\dots }, p_{m}\} . The goal of the ICP algorithm is to find a rotation matrix R and a translation vector t to minimize the following mean squared error:\begin{equation*} E(R, t) = \sum _{i=1}^{m} \| Rp_{i} + t - q_{\text {match}(i)} \|^{2} \tag {1}\end{equation*}

View SourceRight-click on figure for MathML and additional features.In this context, q_{\text {match}(i)} represents the nearest neighbor of p_{i} , that is, the point in Q closest to p_{i} . Considering the heavy dependency of the point-to-point Iterative Closest Point (ICP) algorithm on initial values, suboptimal starting points may lead to an increase in the number of iterations or inaccuracies in the results. Therefore, [69] introduces a point-to-plane ICP algorithm (Figure 4), which improves camera positioning by minimizing the sum of squared distances between each source point and the tangent plane of its corresponding target point, thereby accelerating convergence:\begin{equation*} E(R, t) = \sum _{i=1}^{m} \left ({{ n_{\text {match}(i)}^{T} (Rp_{i} + t - q_{\text {match}(i)}) }}\right)^{2} \tag {2}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
n_{\text {match}(i)} is the normal vector corresponding to q_{\text {match}(i)} . Building upon these advancements, [70] integrated point-to-point ICP and point-to-plane ICP into a single probabilistic framework to form a new algorithm called GICP, which exhibits more robustness against incorrect matches. Due to the cumulative error generated by frame-to-frame matching, the camera trajectory can experience drift, severely affecting the accuracy of pose estimation. KinectFusion [6] used a frame-to-model matching method, greatly reducing cumulative error. Subsequently, [7], [9], [71], [72], [73], [74], and [75] added dense photometric validation based on ICP geometric registration to further optimize the matching algorithm. When matching two sets of point clouds, in order to consider the local features of point clouds (normal vectors, curvature), [76] defined an error function during the iterative solving process that not only included the projected distance of normal vectors between point clouds, but also the direction error of normal vectors, making pose estimation more robust. In addition, there are some other ICP variants, such as efficient ICP [77], non-rigid ICP [78], etc.

FIGURE 4. - The point-to-plane ICP algorithm. It optimizes the camera’s pose by minimizing the distance between the source point and the tangent plane of the corresponding destination point.
FIGURE 4.

The point-to-plane ICP algorithm. It optimizes the camera’s pose by minimizing the distance between the source point and the tangent plane of the corresponding destination point.

b: Feature-Based

The ICP algorithm performs well under the assumption of minor motion between frames. However, it tends to converge to local optima during rapid camera movements, where the difference between successive frames is significant. In order to cope with this situation, [79], [80], [81], [82], [83], [84] extracted feature points (SIFT, SURF, ORB) from color images and use these sparse features to quickly match the pose of each frame. ORB-SLAM [82] is a classic feature-based visual SLAM system. It combines FAST feature detection and BRIEF feature description, ensuring the speed of feature point detection and the stability of description. However, it is limited to feature matching between image frames and is unstable under varying lighting conditions and when feature points are missing. ORB-SLAM2 [83] improves pose estimation by supporting stereo and RGB-D sensors, utilizing depth information for further optimization. Building on this, ORB-SLAM3 [84] further improves the pose estimation algorithm, supports the construction and management of multiple maps, and enhances pose estimation accuracy through the collaborative work of multiple sensors. In addition to using point features for matching, useful edge features [85], [86] can also be extracted from depth images to establish corresponding constraints, thereby enhancing the robustness of pose estimation. To ensure the real-time and accuracy of pose tracking, [87] introduced an information-theoretic approach for point selection in RGB-D direct odometry measurements. This approach simplifies the optimization process while maintaining accuracy by identifying and utilizing data points that carry the most information. CPA-SLAM [89] models the environment using a global model composed of planes, which reduces significant image drift. Reference [88] extracts 3D facial landmarks during face reconstruction for model fine-tuning to ensure the accuracy of head pose estimation. To achieve high-precision feature tracking under rapid sensor motion, [24] performed feature tracking within an extended Kalman filter framework. This framework integrates IMU data to better accomplish sensor motion estimation.

c: Hybrid Method

Pose matching algorithms based on ICP usually require aligning the entire point cloud data, resulting in high computational costs, which are not suitable for real-time 3D modeling of large scenes. In contrast, feature-based matching algorithms can better cope with the limitations of real-time requirements. However, feature-based matching algorithms often require dense features, and the reconstruction quality is significantly affected when the number of matching features in the scene decreases [90]. Combining sparse feature matching with ICP is a good method to balance real-time performance and reconstruction quality. BundleFusion [11] first utilizes sparse SIFT features for coarse pose alignment, and then refines the estimated pose using dense photometric and geometric terms similar to the ICP algorithm, achieving real-time accurate pose estimation and solving the real-time issue of high-quality reconstruction. References [91] and [92] combined edge information with the ICP algorithm to enhance robustness and accuracy. Recently, [93] introduced an enhanced 3D scene reconstruction method using Fast Point Feature Histograms (FPFH) and Iterative Closest Point (ICP) techniques. It improves model robustness and accuracy by modifying the weight calculation formula and employing an enhanced FPFH descriptor for initial registration estimation. To further increase the ICP iteration speed, it also utilizes a Best Bin First (BBF) strategy to reduce data dimensionality.

d: NeRF-Based

The advent of NeRF offers a new paradigm for camera pose optimization. These methods leverage the power of neural networks to synthesize novel views and provide accurate pose estimates, significantly enhancing the robustness and accuracy of camera tracking in static scenes. iNeRF [94] estimates the camera pose by inverting NeRF. Specifically, NeRF optimizes the parameters \Theta of the scene using a given set of camera poses T and observed images I, while iNeRF inversely solves the problem of recovering the camera pose T given the weights \Theta and image I as inputs:\begin{equation*} \hat {T} = \mathop {\mathrm {arg\,min}} _{T \in SE(3)} \mathcal {L}(T \mid I, \Theta) \tag {3}\end{equation*}

View SourceRight-click on figure for MathML and additional features.To solve this optimization problem, iNeRF obtains some estimated camera poses T \in SE(3) in the coordinate system of the NeRF model and renders the corresponding image observations. To update the pose T, the same photometric loss function L used in NeRF is employed:\begin{equation*} \mathcal {L} = \sum _{r \in R} \| \hat {C}(r) - C(r) \|^{2}_{2} \tag {4}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Here, r \in R represents a set of sampled rays, and C(r) is the observed RGB value of the pixel corresponding to ray r in an image. Although iNeRF successfully applied NeRF to pose estimation and achieved excellent results, it still requires an initial pose as a starting point, which affects the convergence of the optimization and the final accuracy. With the development of deep learning, considering the exceptional performance of Generative Adversarial Networks (GANs) in image generation, [95] combines GANs with NeRF to optimize initial pose estimations during reconstruction. It does not rely on known camera poses and can optimize from a completely random initialization, which is particularly useful in uncertain and complex scenes. Additionally, [96] employs a coarse-to-fine camera registration strategy and demonstrates the impact of positional encoding on alignment, effectively optimizing the neural network’s scene representation while addressing pose misalignment issues in large-scale cameras. To further address errors arising from drastic camera movements, [97] introduced an undistorted monocular depth prior based on NeRF and proposes novel loss functions to constrain the relative poses between adjacent frames.

Bundle Adjustment is a technique used to optimize camera parameters and 3D point coordinates in 3D reconstruction. Its main purpose is to improve the accuracy of 3D reconstruction by minimizing the reprojection error. Specifically, the optimization problem can be represented as:\begin{equation*} \min _{P, X} \sum _{i,j} \left \|{{ x_{ij} - \pi (P_{i}, X_{j}) }}\right \|^{2} \tag {5}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where P_{i} represents the parameters of the i-th camera, X_{j} represents the coordinates of the j-th 3D point, x_{ij} is the observed 2D coordinate of the j-th 3D point in the i-th camera, and \pi (P_{i}, X_{j}) is the projection of the 3D point X_{j} onto the 2D image plane through the camera parameters P_{i} .

Inspired by the Bundle Adjustment (BA) algorithm, the NBA (Neural Bundle Adjustment) method proposed by [98] optimizes the implicit surface and camera poses without relying on known camera extrinsics. Specifically, NBA updates the 3D point X at each step as follows:\begin{equation*} X \leftarrow X - \phi (X) \nabla \phi (X) \tag {6}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where \phi (X) is the distance value output by the SDF (Signed Distance Field) network in the neural radiance field, and \nabla \phi (X) is the gradient at that point. After updating the 3D point X, the reprojection error is calculated based on the feature trajectory T, jointly optimizing the SDF network \phi , the estimated camera poses P, and the updated 3D point set X.

2) Loop Closure

During the pose matching of each frame, both ICP-based and feature-based pose matching algorithms can produce errors, and these errors increase with the number of frames. After the camera completes a full rotation around the scene, accumulated errors can cause a misalignment between the starting and ending points. Therefore, we must process loop closure after pose matching to ensure global consistency and reduce the impact of accumulated errors on reconstruction quality. Reference [99] defined keyframes and registers the frames between them to eliminate local errors, and uses the entropy ratio criterion to check loop closure. Reference [71] utilized efficient Pose graph optimization and Sparse bundle adjustment for global consistency alignment. However, this global optimization distributes the residual error over the entire path, resulting in the destruction of the details of the object surface. To further optimize pose estimation, [9] divides all frames into equally sized blocks with one overlapping frame between adjacent blocks. Each small block is reconstructed first, then the overlap frames are used to register the blocks and detect loop closure, and finally erroneous loops are removed to achieve global consistency in reconstruction. Subsequently, BundleFusion [11] performed sparse-to-dense global pose optimization and solves loop closure by integrating and re-integrating previous RGB-D frames during movement, enabling it to correct all drifts. Although this method produces better positional optimization, it requires a large amount of computational resources. To enhance pose estimation accuracy in large-scale scenes, [100] introduced a two-pass loop closure detection method that integrates global and local image features to identify loop closure candidates. Recently, [24] utilized subgraph-based depth image encoding and 3D graph deformation for loop closure to maintain global consistency in the reconstructed model. Reference [101] introduced local 3D deep descriptors (L3Ds) for loop closure handling. L3Ds are compact representations of patches extracted from point clouds, learned using deep learning algorithms, significantly enhancing loop closure detection accuracy.

Another way to address cumulative errors is to assume the scene structure in the world frame and directly align each tracking frame with the scene structure, rather than with keyframes or the last frame. One of the most common assumptions is the Manhattan assumption [102], [103], which represents the scene using a set of orthogonal planes aligned with the world’s three main axes, simplifying the scene understanding and enabling efficient inference of scene geometry and object position. Structure-SLAM [104] employed a convolutional neural network(CNN) to forecast normals and compute drift-free rotations leveraging geometric features under the Manhattan assumption, effectively addressing low-texture regions in indoor settings. Building upon Structure-SLAM, [105] incorporated planar features within the Manhattan framework and introduced an advanced meshing module for reconstructing scene structures, thereby enhancing localization and mapping accuracy.To make the Manhattan assumption more suitable for real-world scenes, ManhattanSLAM [22] directly detected Manhattan frames(MFs) from planes and modeled the scene as a Mixture of Manhattan Frames (MMF), estimating unbiased rotation by observing MFs across frames.

3) Relocalization

Due to factors such as high camera movement speed or changes in viewpoint, camera tracking may fail. Therefore, the ability to quickly recover and perform relocalization when camera tracking fails is essential in the 3D reconstruction process. There are several methods for camera relocalization, including the following:

a: Keyframe-Based

This method requires defining and storing keyframes. When the camera tracking fails, it needs to query the images and estimate the camera pose by measuring the overall image similarity with a known set of keyframes. Reference [106] explored an effective keyframe-based relocation method. In the stage of determining keyframes, besides the threshold based on the distance in space, a similarity discrimination with previous keyframes was added to avoid collecting redundant information. In order to quickly retrieve candidate poses in case of tracking loss, this method uses an efficient frame encoding based on ferns. Keyframe-based methods can perform camera relocation in real-time, but they rely on matching input images with a keyframe database and cannot re-locate in a new pose.

b: Keypoint-Based

This relocation method mainly utilizes the sparsity of feature points. During successful tracking, feature points are detected in the image, and their corresponding descriptors and positions in the world coordinate system are stored in a database. When camera tracking is lost, the current frame’s key points and descriptors are calculated, and a match is performed against the database. After a successful match, the current image’s pose can be obtained to complete camera relocation [107], [108], [109], [110], [111], [112], [113].The challenges of this method include: (1) the choice of feature point and descriptor calculation method, (2) how to store key points and their corresponding descriptors, and (3) how to perform feature matching between frames. Inspired by the idea of visual Bag-of-words, [108] stored the extracted SIFT feature descriptors into a vocabulary during successful tracking, and utilized the Term Frequency-Inverse Document Frequency (TF-IDF) of the visual words in each node to rank the nodes. When tracking is lost, refined relocation poses are obtained by matching the descriptor set in each node with the descriptors extracted from the query image to recover from tracking failure. Reference [110] proposed using a regression forest to directly predict the 3D correspondence of all pixels in the current image to the scene. Compared with traditional keypoint-based methods, this method does not require explicit detection, description, or matching of key points, making it simpler and faster. However, it must train the regression forest in advance on the scene of interest offline, and cannot achieve real-time camera relocation. Reference [112] overcame the limitation of having to train offline by dynamically adapting pre-trained forests to new scenarios.

c: Hybrid Method

Researchers have integrated keyframes and keypoints to enhance relocation accuracy while maintaining real-time performance. Upon tracking failure, [82] adopted the DBoW2 algorithm [114] to identify matching candidate keyframes, subsequently calculating ORB features within these keyframes and employing the PnP algorithm [115] to alternately estimate the current frame’s pose. Reference [86] merged edge features with the keyframe-based method [106], securing robust loop closure and relocalization capabilities.

C. Model Fusion

The pose matching algorithm calculates the initial pose, which is further enhanced by closed-loop processing. After that, the surfaces of the scene need to be fused into the 3D model according to the camera’s position. Currently, there are mainly two types of surface fusion models used: voxel-based and surfel-based.

1) Voxel-Based

As shown in the figure 5(a), an image can be represented by square pixels in 2D space, and extending the pixels to 3D is a voxel. This can intuitively reflect the shape of an object. Reference [116] was the first to propose using the TSDF (Truncated Signed Distance Function) grid model to fuse depth information on the basis of voxel representation. KinectFusion [6] further applied this model to 3D reconstruction using RGB-D cameras. This method requires fixing the size of the scene before reconstruction, making it difficult to scale the scene. For large-scale scene reconstruction, which requires substantial memory, KinectFusion falls short. Therefore, various scholars have extended the original TSDF voxel model:

FIGURE 5. - (a) is a schematic diagram of 2D pixels and 3D voxels, and (b) is a schematic diagram of the octree structure.
FIGURE 5.

(a) is a schematic diagram of 2D pixels and 3D voxels, and (b) is a schematic diagram of the octree structure.

a: Moving Volume

To overcome the limitations of voxel representation, [7], [72], [117] expanded the reconstruction area to infinite space by moving voxels. Thomas Whelan [7] utilized a cyclic buffer data structure to effectively recycle GPU memory, addressing the issue of insufficient memory for large-scale scene reconstruction with voxel models. The algorithm enabled camera translation and rotation in the real world, incrementally enlarging the reconstructed surface. Reference [117] proposed the Moving Volume KinectFusion method, which establishes a TSDF buffer and a swap buffer. Utilizing a double buffering mechanism to map between volumetric models during camera movement, the method allows for online processing of volume rotations and translations through voxel interpolation.

b: Octree-Based

The geometry of most objects is very sparse with respect to the whole scene body, which means that the voxels in the TSDF model are mostly empty and all this storage space is wasted. The octree structure is a data model first proposed by Dr. Hunter [118] in 1978. As shown in 6(b), this structure can effectively utilize memory by dividing the scene space, thereby improving storage efficiency. Although its definition is simple, it is difficult to maintain the parallelism of the GPU due to the sparsity of its nodes. Reference [20] and [119], designed a novel octree data structure to improve the reconstruction update and surface prediction parts of KinectFusion, which can fully utilize the parallelism of the GPU, greatly improving storage efficiency and further expanding the reconstruction scale. To reduce memory consumption, [99] fused the acquired depth and color information into a multiscale octree representation of a signed distance function, which can maintain low memory usage while achieving high accuracy. To further improve storage efficiency, [120] defined an octree data structure that supports volume multiresolution 3D mapping and mesh partitioning, reducing memory consumption by only allocating units close to the surface.

c: Voxel Hashing

Although the octree structure can improve the storage efficiency of the model to some extent, complex octree structures still have additional computational complexity and pointer overhead. A simple spatial hashing scheme is used in [8] to compress space, which allows data to flow efficiently in and out of a hash table, enabling real-time access and updates of surface data in the scene without the need for complex hierarchical data structures. Voxel hashing has been widely used in real-time 3D reconstruction [21], [121], [122].

d: Deep Learning-Based

The ability of neural networks to learn rich prior knowledge provides new directions for the development of scene representation. When exploring cluttered indoor scenes with an RGB-D camera, [123] initialized the truncated signed distance function (TSDF) reconstruction of each object with compact instance segmentation using Mask-RCNN, which resulted in a resolution related to object size and novel 3D foreground masks. Reference [124]reconstructed real-time scenes with both geometry and semantic information by incorporating semantic predictions from neural networks into the voxel-based model built on Voxel hashing.

2) Surfel-Based

Voxel-based methods are expensive for handling loop closures in real-time 3D reconstruction because precise compensation may involve changing the entire volume. Moreover, the size of the voxel volume is typically fixed in practice, which limits the adaptivity of representation. If an object is relatively small or thin compared to the voxel size, it can seriously affect the reconstruction quality.

In surfel-based scenes (Figure 6). This representation has the following advantages: (1) Flexibility. When performing point fusion updates, the data is updated using weighted fusion, where the radius of the surface patch is related to the distance between the camera center and the scene surface. The farther the distance, the larger the radius of the surface patch, and this updating method can effectively reconstruct the entire surface. (2) High adaptability. It can measure more densely distributed points at high resolution. (3) It can easily handle thin objects.

FIGURE 6. - Representation of object surface by Surfels. The position information, radius of the surface patch, normal vector, color information, and time information of each point are stored.
FIGURE 6.

Representation of object surface by Surfels. The position information, radius of the surface patch, normal vector, color information, and time information of each point are stored.

Reference [125] first introduced the concept of surfels and provide a detailed description of surface scenes. ElasticFusion [10] utilized this representation for real-time dense scene reconstruction. The system continuously optimizes the reconstructed map to improve the accuracy of reconstruction and pose estimation, and employs Random Ferns to detect loop closures for global consistency.

With the development of deep learning, in order to fully utilize the semantic information of the scene, Semanticfusion [127] combined CNN and ElasticFusion to successfully fuse semantic predictions from multiple viewpoints into a surfel-based representation. Reference [126] proposed an indoor RGB-D image semantic segmentation network with multi-scale feature fusion based on ElasticFusion. It integrates the visual color features and depth geometric features of RGB-D images, improving the accuracy of image semantic segmentation. The segmentation results are shown in Figure 7. DeepSurfels [128] integrated feature information learned from RGB images with detailed representations of facets, making it possible to reconstruct large-scale scenes in real-time. To obtain high-quality surface texture, [129] employed Shape-from-Shading (SfS) and spatially-varying spherical harmonics (SVSH) techniques to simultaneously optimize geometry, texture, and camera poses. The main drawback of the surfel representation is its discreteness, which can be addressed by the meshing approach. Reference [130] created a triangle mesh and performs real-time mesh reconstruction from RGB-D video, which works well for reconstructing thin objects. However, this method requires camera poses as additional input. In contrast, [25] utilized Hermite Radial Basis Functions (HRBF) implicits for direct camera tracking and RGB-D reconstruction, which is a dynamic surface representation that effectively reduces the influence of noise and reconstructs using surface photometric constraints.

FIGURE 7. - Indoor semantic segmentation results. Figure taken from [126].
FIGURE 7.

Indoor semantic segmentation results. Figure taken from [126].

3) NeRF-Based

Explicit representations like voxel and surfel allow real-time scene reconstruction, but they face challenges in mapping accuracy and balancing memory consumption. Moreover, they lack in novel view synthesis capabilities. In recent years, with the introduction of NeRF, implicit representations have overcome limitations associated with explicit representations, generating high-fidelity reconstructions with reduced memory usage. These implicit representations achieve this by continuously querying scene properties to generate high-quality images from novel viewpoints. IMAP [23] demonstrated for the first time that Multi-Layer Perceptrons (MLPs) can serve as the sole scene representation in real-time SLAM systems using handheld RGB-D cameras, utilizing keyframe structures and multi-processing computation flows. Reference [131] utilized the point cloud provided by COLMAP and reprojection errors to enforce depth constraints in NeRF, effectively enhancing the rendering speed and reconstruction quality of NeRF. To reduce computational costs and enhance scalability, NICE-SLAM [26] applied the hierarchical scene representation concept to NeRF. However, due to the local updates performed by NICE-SLAM’s feature grid, it fails to achieve reasonable hole filling. Co-SLAM [27] combined coordinate and sparse parameter encoding for scene representation and employed dense global bundle adjustment using rays sampled from all keyframes. Simultaneously, [132] proposed a NeRF-based mapping approach using a hierarchical hybrid representation, leveraging implicit multiresolution hash encoding and explicit octree Signed Distance Function (SDF) priors to describe scenes at different detail levels, achieving real-time high-fidelity dense mapping and dynamic expansion capabilities. Since NeRF does not reconstruct actual surfaces and pseudo-shadows occur when using Marching Cubes to extract voxel-based surfaces, [45] used Truncated Signed Distance Functions (TSDF) to represent surfaces, extending them to commodity RGB-D sensors to reconstruct high-quality 3D scenes. Recently, NGEL-SLAM [133] employed a sparse octree grid integrated with implicit neural maps, ensuring memory efficiency and precise environmental depiction.

D. Surface Extraction

Once the surface information of a scene has been fused into a model based on the camera pose, a surface extraction algorithm is required to obtain a visual representation of the surface. Depending on how the reconstructed scene is represented and stored, surface extraction algorithms can be classified into raycasting and marching Cubes.

1) Raycasting

The surface extraction method proposed by [134] based on Raycasting primarily involves using rays emitted from the camera center and passing through pixels to project onto the object surface to find the iso-surface.

The basic process of this method is as follows (Figure 8): firstly, a ray is projected along the viewing direction from each pixel on the image plane, which passes through the surface of the object. Then, sampling is performed at a certain step size, and linear interpolation algorithms are used to find the intersection point with the surface. This essentially means checking the value of the truncated signed distance function at each voxel along the ray until the first zero-crossing is found. This algorithm is widely used for surface extraction in voxel models [6], [7], [8], [11], [117], [129].

FIGURE 8. - Diagram of Raycasting. It detects and calculates intersections with objects in the scene by casting rays, thereby extracting surface information.
FIGURE 8.

Diagram of Raycasting. It detects and calculates intersections with objects in the scene by casting rays, thereby extracting surface information.

2) Marching Cubes

The marching cubes algorithm was initially proposed by Lorensen [135], who divided the three-dimensional geometry into small cubes called voxels and defined the voxels using scalar values at the eight corners of each cube. As shown in the Figure 9, if the data value at a vertex of the cube is greater than or equal to the value of the surface we are constructing, the vertex is assigned a value of 1, and 0 otherwise. Under this assumption, when the surface intersects with a cube, the intersection points between the isosurface and the edges of the cube are calculated using interpolation, and then the intersection points of each edge are connected in a certain way to represent the isosurface inside the cube. After finding the isosurface passing through this cube, move to the next cube to continue searching for the isosurface. This is the process of extracting the surface using the marching cubes algorithm.

FIGURE 9. - Diagram of Marching Cubes.
FIGURE 9.

Diagram of Marching Cubes.

Due to the huge amount of storage required to reconstruct high-quality models, the octree storage method has been applied to the reconstruction models to get rid of the limitations of commercial computer memory. However, extracting the reconstructed surface from the octree representation is more complicated than extracting it from a regular voxel grid. Reference [136] and [137] extended the Marching Cubes algorithm by addressing the inconsistencies that arise when adjacent leaf nodes in the octree have different depths. Reference [138] proposed a method of marking edges with Hermite data to generate signed grid contours, and extended this method to octrees. By aligning the vertices of the dual grid with the characteristics of the implicit function, [139] can extract iso-surfaces that capture small, thin, and even sharp features in the surface without excessively refining the octree. Reference [140] introduced the concept of an edge tree to provide a method for directly extracting watertight mesh without restricting the octree topology or modifying vertex values.

With the application of deep learning and NeRF technology in 3D reconstruction, [141] proposed a data-driven method called Neural Marching Cubes (NMC) for extracting triangular meshes from discrete implicit fields. This method addresses the shortcomings of traditional surface extraction methods in recovering geometric features such as sharp edges and smooth curves. Specifically, NMC redesigns the mesh subdivision template and introduces neural networks to learn vertex positions and mesh topology, thereby better preserving geometric features. Recently, [142] proposed another method called NeuralMeshing. This method generates meshes iteratively, making it suitable for shapes of various scales and capable of adapting to local curvature, thereby significantly improving the quality of surface extraction.

In conclusion, the integration of deep learning into static 3D reconstruction has brought significant advancements, providing robust and accurate solutions for depth image enhancement, camera tracking, model fusion, and surface extraction. These methods leverage the powerful learning capabilities of neural networks to improve the quality and efficiency of 3D reconstructions, offering new perspectives and directions for future research in this field.

SECTION IV.

Reconstruction of Dynamic Scenes

Dynamic scenes consist of both dynamic objects and static backgrounds. Figure 10 illustrates the general process of dynamic reconstruction. In the following chapter, we will discuss recent developments in dynamic 3D reconstruction, focusing on three aspects: segmentation of dynamic objects, Camera tracking, and model fusion.

FIGURE 10. - Overview of the dynamic indoor reconstruction pipeline. The first step is data acquisition, which is the same as in static 3D reconstruction. The second step is data preprocessing, which involves not only denoising the raw image data but also separating the dynamic objects from the scene. Then, camera tracking is performed using the static background information to align the current frame data with the previous frame or model, finding the correspondences between them and reconstructing the static background. Meanwhile, the dynamic objects are reconstructed separately. Finally, the dynamic objects and static background are merged to complete the reconstruction of the entire scene.
FIGURE 10.

Overview of the dynamic indoor reconstruction pipeline. The first step is data acquisition, which is the same as in static 3D reconstruction. The second step is data preprocessing, which involves not only denoising the raw image data but also separating the dynamic objects from the scene. Then, camera tracking is performed using the static background information to align the current frame data with the previous frame or model, finding the correspondences between them and reconstructing the static background. Meanwhile, the dynamic objects are reconstructed separately. Finally, the dynamic objects and static background are merged to complete the reconstruction of the entire scene.

A. Segmentation of Dynamic Objects

In contrast to static 3D reconstruction, dynamic scenes contain freely moving objects that significantly affect camera pose estimation. Moreover, entities such as human beings and animals undergo non-rigid deformations while in motion. To handle the reconstruction of these dynamic objects, the first step is to distinguish between dynamic and static features, a process known as motion segmentation. Various approaches, including motion analysis-based methods and deep learning-based methods, are employed for motion segmentation in the scene to identify the dynamic characteristics.

1) Motion Analysis-Based Methods

Methods based on motion analysis separate dynamic objects from the static background by detecting object movement within the scene. Examples of such methods include geometric methods, optical flow methods, etc. DynamicFusion [29] utilized geometric features to separate dynamic objects and defined a canonical model specifically for reconstructing non-rigidly deforming dynamic objects. The canonical model was transformed to the live frame using voxel deformation fields. This method addresses the deformation issues of dynamic objects during motion, enhancing the robustness and accuracy of reconstruction. Similar to this approach, Nerfies [143] enhanced NeRF by optimizing an additional continuous volume deformation field, which warps each observed point into a canonical 5D NeRF representation. D-NeRF [48] incorporated time as an additional input to the system and divides the learning process into two main stages: one stage encodes the scene into canonical space, and another stage maps the canonical representation into a deformed scene specific to a particular time. VolumeDeform [30] combined SIFT features extracted from RGB images with depth maps for motion tracking, enhancing the robustness of matching point recognition. References [33] and [144] applied K-means clustering to perform visual clustering and assigned static weights to each clustered pixel or point. Reference [145] estimated static weights based on the distances between corresponding point and line features and applied filtering to the data related to dynamic targets using these static weights, achieving precise localization and tracking of the targets. Reference [146] constructed a foreground model based on the mutual motion between two frames and combined it with RGB-D frame information to segment dynamic and static feature points. Reference [36] initially employed a simple and efficient clustering algorithm to group spatially and appearance-related pixels of each keyframe into different regions, then identified Candidate Dynamic Keypoints (CDK) in consecutive frames with large reprojection errors and recognized regions with a high CDK ratio as dynamic regions. Reference [147] observed that regardless of camera movement, the triangles formed by any three fixed points on a static object remain fixed, and these triangles formed by the three points in different camera coordinate systems are similar. Therefore, the authors determined whether a feature point is static or dynamic by comparing the similarity of the triangles formed by three sets of feature points in two keyframes. Reference [148] introduced a grid-based feature extraction approach that enables fast and efficient extraction of high-quality FAST feature points. Additionally, it combined inertial measurement units for motion prediction, achieving feature tracking and motion consistency detection.

2) Deep Learning-Based Methods

Unlike traditional methods based on motion analysis, deep learning-based methods can learn semantic information as priors from training datasets, and the extraction of semantic information through various image processing techniques has different impacts on dynamic scene problems. Currently, many methods use semantics to make motion segmentation more robust [35], [39], [43], [149], [150], [151], [152], [153], [154], [155], [156], [157], [158]. These methods employ deep neural networks for semantic segmentation and object recognition on RGB-D images to achieve dynamic object detection and tracking.

Mask-RCNN [39] is an instance-level segmentation algorithm based on images that can provide prior information on dynamic objects in a scene. As shown in Figure 11(a), it can provide bounding boxes for dynamic objects. However, within the bounding boxes, some static background areas are classified as dynamic foreground areas, and some dynamic foreground objects are classified as static background. In RGB-D data, utilizing the depth difference between dynamic regions and static background can better optimize segmentation results. Additionally, compared to smooth areas within objects, the normal and variance differences at the boundary between dynamic regions and their adjacent static background are also greater. Therefore, [39] uses connected component analysis to optimize the segmentation results. Specifically, the dynamic weight of a pixel is given by the following formula:\begin{equation*} \mathcal {O} = {\mathcal {O}}_{d} + {\mathcal {O}}_{o} + \gamma _{1} {\mathcal {O}}_{e} \tag {7}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where {\mathcal {O}}_{d} is the depth difference, {\mathcal {O}}_{\sigma } is the variance, \gamma _{1} is the weight for the normal difference {\mathcal {O}}_{e} , and is obtained by the following formula:\begin{align*} {\mathcal {O}}_{d} & = \max _{i \in N} |(\mathbf {v}_{i} - \mathbf {v}) \cdot \mathbf {n}| \tag {8}\\ {\mathcal {O}}_{e} & = \max _{i \in N} \begin{cases} \displaystyle 0, & \text {if} ((\mathbf {v}_{i} - \mathbf {v}) \cdot \mathbf {n}) \lt 0 \\ \displaystyle 1 - (\mathbf {n}_{i} \cdot \mathbf {n}), & \text {else} \end{cases} \tag {9}\\ {\mathcal {O}}_{\sigma }& = \sqrt {\frac {1}{N} \sum _{i=1}^{N} (\mathbf {v}_{i} - \mathbf {v})^{2}} \tag {10}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where v represents the point on the depth map, n is the normal of that point, N denotes the set of neighborhood point indices for v, and \mathbf {v}_{i} represents the neighboring points of v. Figure 11(b) shows the results optimized using the connected component method based on the value of \mathcal {O} .

FIGURE 11. - Segmentation of dynamic regions. (a) is the result of Mask-RCNN. (b) is the result optimized using the connected component analysis method. Figure taken from [39].
FIGURE 11.

Segmentation of dynamic regions. (a) is the result of Mask-RCNN. (b) is the result optimized using the connected component analysis method. Figure taken from [39].

Additionally, PSPNet-SLAM [151] and DDL-SLAM [152] use PSP-Net and DU-Net respectively as deep learning (DL) models for segmenting dynamic scenes and static backgrounds. However, these segmentation methods using DL models require high memory consumption and computational cost. To improve the real-time performance of the reconstruction system, LRD-SLAM [153] proposed a fast deep convolutional neural network (FNet) for semantic segmentation, which can quickly and accurately identify pedestrian information in a given scene.

Moreover, [154] found that most classic semantic SLAM methods generate semantic results for each frame individually, such as DynaSLAM [155], DS-SLAM [156], and DM-SLAM [35], leading to redundant operations. Since the input to visual SLAM is a sequence of continuous frames, the segmentation results of consecutive frames have many similarities, making it unnecessary to segment each frame. Reference [154] segments only the keyframes and propagates the segmentation results of the keyframes to their adjacent frames, significantly avoiding the time delay caused by segmenting each frame while ensuring segmentation accuracy. The experimental results are shown in Figure 12. Recently, [43] and [157] employed YOLO v5 for detecting dynamic objects in the scene, further enhancing segmentation accuracy. DDN-SLAM [158] leveraged deep semantic system priors and conditional probability fields for effective segmentation. Through the creation of depth-guided static masks and the use of joint multi-resolution hashing encoding, it achieves rapid hole filling and superior mapping quality, effectively reducing the impact of dynamic information.

FIGURE 12. - The propagation of dynamic probabilities during tracking, where green points indicate feature points with initial dynamic probabilities, blue points represent identified static feature points, and red points represent dynamic feature points. Figure taken from [154].
FIGURE 12.

The propagation of dynamic probabilities during tracking, where green points indicate feature points with initial dynamic probabilities, blue points represent identified static feature points, and red points represent dynamic feature points. Figure taken from [154].

B. Camera Tracking

In static scene reconstruction, for slightly moving dynamic objects, we treat them as static and ignore their motion during pose estimation by utilizing rigid alignment. However, due to significant variations in the motion of dynamic objects in dynamic scenes, directly applying static algorithms for camera pose tracking would lead to failure. Therefore, the handling of dynamic objects is crucial for solving the problem of dynamic 3D reconstruction. One approach considers dynamic objects as outliers and directly removes them during the reconstruction process, focusing on pose estimation for the static scene. This method is referred to as the direct approach. Another approach involves utilizing the features of dynamic objects while simultaneously reconstructing both static and dynamic objects in the scene. This approach is known as the indirect approach. In this chapter, we discuss how to handle dynamic objects from these two perspectives in order to achieve accurate pose estimation in dynamic scenes.

1) Direct Approach

In a real indoor environment, it is inevitable to encounter dynamic objects, such as freely moving people or rolling balls. These dynamic objects can significantly impact camera tracking: during localization, the camera struggles to acquire sufficient static features due to occlusions caused by dynamic objects, leading to localization failure. When dealing with loop closure, the displacement of dynamic objects confuses the camera as the scene exhibits different visual appearances compared to when the camera last observed the dynamic objects. Direct approaches solve the interference of dynamic objects on camera pose estimation by removing the data of dynamic objects during reconstruction.

Reference [159] introduced a robust background model-based dense-visual-odometry (BaMVO) algorithm to estimate the background for each frame and perform camera pose estimation by eliminating foreground moving objects. This method effectively reduces the impact of dynamic objects on the camera trajectory, enhancing reconstruction accuracy. Similarly, [146] filtered out the data associated with dynamic objects directly in the preprocessing stage to enhance the robustness of RGB-SLAM. Reference [160] utilized a bayesian framework for dynamic region detection, considering prior knowledge and observation information generated during object detection. After obtaining the detection results, dynamic regions are removed, and only features from static regions are extracted for camera tracking. Reference [150] combined ORB-SLAM2 with the PSPNet [161] semantic segmentation network to propose the PSPNet-SLAM system, which first removes dynamic points with large optical flow values using optical flow and then performs secondary filtering using PSPNet to ensure more accurate matching. Most of these methods detect dynamic objects by analyzing only a few consecutive frames. However, since many dynamic objects may remain static for a short period of time, this can lead to failure in detecting moving objects. Based on this observation, LC-CRF SLAM [162] constructed a long-term consistent conditional random field (CRF) that provides more accurate camera trajectory estimation through long-term observations across multiple frames. Semantic information based on deep learning can eliminate the influence of dynamic objects, but it involves high computational cost and cannot handle unknown objects. Reference [163] proposed a real-time semantic RGB-D SLAM system tailored for dynamic environments, which performs semantic segmentation only on keyframes to remove known dynamic objects and maintains a static map for robust camera tracking. After removing dynamic objects, [153] repaired missing static background information using information from keyframes to facilitate subsequent point cloud reconstruction. Reference [42] introduced a hierarchical representation method for images, segmenting images into planar and non-planar regions. By removing dynamic non-planar objects, it segments and tracks multiple dynamic planar rigid objects. Reference [40] created a sparse graph using Delaunay triangulation from all map points, and then used the correlation of the mapped points in the graph to divide the points in the scene into groups, where the largest group is considered to be the static map points. Finally, only these static points were used in estimating the camera motion. Recently, [43] incorporated an improved dense point cloud generation module into ORB-SLAM3 [84], which removes dynamic objects using dynamic object information extracted by YOLO v5, resulting in a point cloud representation of the static scene and obtaining clearer camera poses by eliminating dynamic objects.

2) Indirect Method

Instead of removing dynamic objects, the indirect method analyzes features, assigning weights to both static and dynamic objects. Based on these weights, it fully utilizes static and dynamic features to achieve scene tracking and mapping. References [41], [164], and [165] assigned weights to static points and utilized these static weights to eliminate the influence of dynamic objects. Reference [164] employed depth edge points for frame-to-keyframe registration, where each edge point is assigned a static weight that is then used in the Intensity-assisted Iterative Closest Point (IAICP) algorithm for motion estimation, thereby reducing the impact of dynamic components. Additionally, effective loop closure detection was incorporated to decrease tracking errors. Reference [165] utilized double K-means clustering to detect dynamic objects, followed by establishing static weights for the feature points in the current frame, which comprehensively consider static probability and static observation value (SON). Finally, the traditional RANSAC algorithm was modified to suit dynamic reconstruction. With the advancement of deep learning, the integration of semantic knowledge into dynamic reconstruction yields excellent results. Reference [41] proposed the DPF-SLAM algorithm, which combines the dynamic prior probability obtained from semantic segmentation with the dynamic probability obtained from dynamic point detection, thereby reducing the influence of dynamic objects on camera localization.

In dynamic scene reconstruction, it is commonly assumed that the predominant portion of image frames represents the static background. However, in complex scenes with numerous dynamic objects, without semantic segmentation, a significant portion of dynamic objects may be mistakenly identified as static backgrounds. Moreover, dynamic objects can occlude a substantial amount of color and depth information, leading to insufficient static information in visual input to support accurate self-motion estimation of the camera. Reference [32] employed a multi-model fitting approach to identify dynamic objects in the scene using motion segmentation and semantic segmentation. Each object was reconstructed individually, and over time, increasingly refined dynamic models can be obtained. Complete geometric objects play a crucial role in tracking camera trajectories. Reference [166] inferred the complete geometric shapes of each object to establish correspondences among instances, enabling the estimation of object poses in each frame. Reference [167] employed multiple motion segmentation methods to segment the motion models of different moving objects, obtaining accurate masks for the moving objects and generating 4D models and trajectories of the moving objects in a global reference frame, while reconstructing dense maps of the static background. Reference [168] introduced rigid and motion constraints to model articulated objects, allowing for the joint optimization of camera pose, object motion, and object 3D structure. This approach corrects estimation errors in camera pose, prevents tracking loss, and generates a 4D spatiotemporal map that includes both dynamic targets and static scenes.

C. Model Fusion

The fusion of dynamic scenes extends the fusion of static scenes and includes two approaches: voxel-based, surfel-based, and NeRF-based.

1) Voxel-Based

DynamicFusion [29], as a pioneering real-time dynamic 3D reconstruction algorithm, extended Projective TSDF to reconstruct and fuse scenes. VolumeDeform [30] utilized a voxel model that not only stores scene data in the undeformed pose, such as TSDF values, color, and confidence value, but also stores information about the current spatial deformation, represented by deformation field parameters. These mainly focus on scenes with individually deformable objects moving in a non-rigid manner. When multiple dynamic objects are present in the scene, MID-Fusion [169] integrates depth, color, semantic, and foreground probability information into an object model based on an octree volume representation using foreground and background masks. EM-Fusion [34] reconstructed dynamic objects based on the TSDF model and creatively used Expectation-Maximization (EM) to determine the unknown association between pixels and objects. Reference [170] introduced a neural scene flow field, which defines a set of voxel boundary implicit fields using a sparse voxel octree, to simulate local properties and achieve the reconstruction of complex dynamic scenes. These TSDF representations store different dynamic objects of the entire scene as multiple 3D models, enabling easy fusion and updating based on their respective poses. However, during surface extraction, ray casting needs to be performed separately for each object model. Additionally, occlusion handling during ray casting is required to determine surface visibility when objects are occluded. To address these issues, [171] proposed a novel map representation method called TSDF++, which allows simultaneous reconstruction of both static scenes and dynamic objects within a 3D volume model.

2) Surfel-Based

Volumetric methods can generate smooth triangle meshes, but they suffer from high computational and memory costs. Surfel-based methods, on the other hand, are more efficient, but post-processing is required if a mesh model is desired [172]. In the case of non-rigid objects in dynamic scenes, the computation becomes more complex, making surfel-based representations a promising solution for real-time dynamic reconstruction. Co-Fusion [32] extended the surfel-based mapping framework of ElasticFusion to handle dynamic scenes, enabling tracking and reconstruction of segmented dynamic objects based on motion and semantic cues in each frame. However, Co-Fusion lacks real-time capabilities for dynamic scene reconstruction. MaskFusion [173] built upon Co-Fusion to create a real-time dynamic 3D reconstruction system, representing each geometric entity as a set of surfaces and incorporating semantic information into the map. Reference [33] proposed a strategy to assess the feasibility of each surfel by estimating the dynamic probability of each input point and eliminating surfels that match dynamic input points. Reference [174] achieved real-time non-rigid reconstruction using a stream of depth images as input, along with surfel-based scene representation, effectively handling topological changes and tracking failures, resulting in efficient dynamic 3D reconstruction. Surfel maps typically consist of a large number of surfels, requiring powerful GPUs for online processing. Reference [175] introduced the use of superpixels to generate surfel maps, significantly improving storage efficiency and speed. If an observation or fusion count falls below a threshold within a certain time frame, the corresponding surfels are considered outliers and are removed to mitigate the influence of dynamic objects on the reconstruction results. The SLAM system proposed in [176] relies on super-surface elements [177], which are planar patches generated from superpixels, to model the static parts of the environment.

3) NeRF-Based

There have been numerous attempts to apply NeRF to scenes from RGB video streams [178], [179]. However, these methods reconstruct scenes under the assumption of accurate camera poses, heavily relying on previous camera registration, which can fail when objects undergo significant motion. Unlike NeRF reconstruction solely from RGB data, the introduction of depth information makes these issues more manageable. Simultaneously, to handle geometries beyond the sensor’s explicit range and regions with low reflectivity, raw continuous-wave ToF images are used instead of direct depth maps, effectively enhancing the view synthesis quality in dynamic scenes [180].

Reference [181] proposed a new framework called Time-Aware Neural Voxels (TiNeuVox), which by explicitly representing temporal information in dynamic scenes, making the modeling and rendering of dynamic scenes more efficient. Specifically, it inputs the time embedding {\mathbf {t_{i}}} = \Phi _{d}(\gamma (t_{i})) and the coordinates of the sample points (x, y, z) into a compact deformation network \Phi _{d} , resulting in the deformed coordinates (x', y', z') :\begin{equation*} x', y', z' = \Phi _{d}(x, y, z, \mathbf {t_{i}}) \tag {11}\end{equation*}

View SourceRight-click on figure for MathML and additional features.Additionally, to more accurately reconstruct the motion trajectories of points and improve training speed, [181] utilizes a multi-distance interpolation method to capture motion at different scales:\begin{align*} v & = v_{1} \oplus \cdots v_{m} \cdots \oplus v_{M}, \tag {12}\\ v_{m} & = \text {interp}(x, y, z, V[::s_{m}]), \tag {13}\end{align*}
View SourceRight-click on figure for MathML and additional features.
v is the concatenation result of interpolated features v_{m} across multiple scales. v_{m} is obtained by interpolating at the point (x, y, z) after sampling the voxel grid V with a stride of s_{m} . Moreover, Neural-Dynamic Reconstruction (NDR) was proposed to recover high-fidelity geometric shapes and motion of dynamic scenes from monocular RGB-D cameras [182]. It employs a novel neural reversible deformation network to represent and constrain non-rigid deformations. A topologically aware strategy is employed to establish correspondences for fused frames. Building upon this, DNA-Net [183] modeled dynamic motions using articulated bones, aiding the model in faster convergence and making it more suitable for applications such as human pose manipulation. Due to errors in pose estimation that SFM algorithms often produce in highly dynamic scenarios with poor-textured surfaces, [184] reconstructed the entire scene using both static NeRF and dynamic NeRF. The static NeRF is responsible for reconstructing static scenes and estimating camera poses and focal lengths. Simultaneously, the dynamic NeRF simulates the dynamic aspects of the scene from the video. Recently, the effectiveness of depth-constrained NeRF in dynamic operating rooms [185] has been validated, demonstrating the generation of geometrically consistent views from novel perspectives.

SECTION V.

Dataset and Evaluation Metrics

A. Datasets

In this section, we present the relevant datasets used by the static and dynamic reconstruction algorithms. Table 3 briefly summarizes the scenes, details, applications and publication years included in the relevant datasets.

TABLE 3 Overview of Static and Dynamic Datasets
Table 3- Overview of Static and Dynamic Datasets

The TUM dataset [186] utilizes Microsoft Kinect to capture RGB-D data of office scenes and an industrial hall. The dataset consists of 39 sequences, including color images, depth maps, and ground truth camera poses associated with time. The fr1 and fr2 sequences primarily feature static scenes. The fr3 sequences are used for quality evaluation of dynamic 3D reconstruction. In the fr3/sitting sequence, two people are sitting and lightly moving while chatting at a table. The fr3/walking sequence involves two people walking around a table, exhibiting highly dynamic motion. Additionally, the dataset provides two evaluation metrics, Relative Pose Error (RPE) and Absolute Trajectory Error (ATE), which can be used to assess the performance of visual SLAM systems.

ICL-NUIM dataset [187] is used for evaluating RGB-D visual odometry, 3D reconstruction, and SLAM systems. The data acquisition involves capturing images from a camera trajectory of 3D models rendered using POVRay. The dataset includes two different scenes, a living room and an office, with the living room scene being a newly introduced scene that includes relevant 3D polygon models to assess the accuracy of the final reconstruction results. Aug-ICL-NUIM dataset [9] expands the ICL-NUIM dataset with challenging camera trajectories and realistic noise models, enhancing the dataset in various ways to accommodate the evaluation of complete scene reconstruction pipelines.

The CoRBS dataset [188] provides real depth and color data, along with ground truth camera trajectories and 3D models of the scenes. It includes 20 KinectV2 image sequences captured from four different scenes. The data is directly provided in a global coordinate system, enabling direct evaluation without the need for further alignment or calibration.

The NYUD v2 dataset [189] consists of 1449 indoor RGB-D images captured using Microsoft Kinect devices, accompanied by detailed annotations. However, due to its relatively small scale, it is challenging to apply it to deep learning architectures. The SUN3D dataset [190] offers a large-scale RGB-D video database that includes semantic information and corresponding pose information for objects in each scene. The RGB-D Object dataset [191] contains 300 objects from 51 categories, providing high-quality color and depth images for each frame in the videos. Additionally, it introduces RGB-D-based object recognition and detection techniques that significantly improve the reconstruction quality by leveraging both color and depth information.

In widely used RGB-D datasets, there is a challenge of lacking comprehensive and fine-grained annotations. Binh-Son Hua et al. introduced SceneNN [192], which collects RGB-D information from over 100 indoor scenes, including dormitories, offices, classrooms, etc. The authors reconstructed each scene using triangle meshes and added per-vertex and per-pixel annotations, enriching the dataset with fine-grained information. John McCormac introduced SceneNet RGB-D [193], which provides accurate pixel-level semantic information for scene understanding tasks such as object detection, semantic segmentation, and instance segmentation. Additionally, it provides camera poses and depth data to facilitate the study of geometric problems in computer vision.

The PASCAL VOC dataset [194] provides more than 20 object classes that can be used for handling potentially moving objects, such as humans, cats, and dogs. This dataset can be used to train segmentation networks for segmenting dynamic objects. Compared to PASCAL VOC, the MS COCO dataset [195] offers a larger number of categories and instances, enabling models to better learn contextual information. ShapeNet [196] is a large-scale 3D model dataset that includes 3D models from various semantic categories, making it suitable for part segmentation tasks.

The Bonn RGB-D dynamic dataset [197] provides rich and complex dynamic data, including 24 highly dynamic scenes. To apply data-driven methods to non-rigid 3D reconstruction, DeepDeform [198] utilizes a semi-supervised labeling approach and obtains a large dataset of 400 scenes, consisting of over 390,000 RGB-D frames and 5,533 densely aligned frame pairs. HRPSlam [199] is the first system to capture dynamic RGB-D data using a humanoid robot, simulating scenarios such as human walking with jitter or falling. The provided dataset includes two complete loops and can be used for evaluating global loop closure or local reconstruction in dynamic environments. The Oxford-IHM dataset [200] uses multiple large objects as static obstacles and records the walking trajectories of people in indoor environments.

B. Evaluation Metrics

In 3D reconstruction, the accuracy of the camera pose directly affects the reconstruction accuracy, making it crucial to evaluate pose accuracy. If the pose error is large, it can cause the reconstructed model to be distorted or inaccurate, thereby affecting the overall quality of the model. The metrics for evaluating pose accuracy mainly include Relative Pose Error (RPE), Absolute Pose Error (ATE), and Alignment Error (AE). The overall quality of the reconstruction can be measured by surface accuracy. Next, we will specifically introduce these evaluation metrics.

Relative Pose Error (RPE): RPE measures the local accuracy of the trajectory over a fixed time interval \Delta . For the estimated trajectory P_{1}, \ldots, P_{n} \in SE(3) and the ground truth trajectory Q_{1}, \ldots, Q_{n} \in SE(3) , the relative pose error at time step i is given by:\begin{equation*} E_{i} = \left ({{ Q_{i}^{-1} Q_{i+\Delta } }}\right)^{-1} \left ({{ P_{i}^{-1} P_{i+\Delta } }}\right) \tag {14}\end{equation*}

View SourceRight-click on figure for MathML and additional features.From a sequence of n camera poses, we can obtain m = n - \Delta relative pose errors. The overall relative pose error is evaluated by calculating the root mean square error (RMSE) of the translation components across all time indices:\begin{equation*} \text { RMSE}(E_{1:n}, \Delta) = \left ({{ \frac {1}{m} \sum _{i=1}^{m} \|\text {trans}(E_{i})\|^{2} }}\right)^{1/2} \tag {15}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \text {trans}(E_{i}) represents the translation component of the relative pose error E_{i} . The Relative Pose Error is an important metric for evaluating the local accuracy of trajectory estimation. By comparing the relative motion over a period of time, it effectively measures and analyzes the system’s pose accuracy over short time intervals.

Absolute trajectory error (ATE): ATE provides a direct numerical measure that intuitively reflects the algorithm’s accuracy and the global consistency of the trajectory. When calculating ATE, the trajectories are first aligned using the Horn method [201], which finds the rigid body transformation S corresponding to the least-squares solution that maps the estimated trajectory P_{i} to the ground truth trajectory Q_{i} . After obtaining this transformation, the absolute trajectory error F_{i} at time step i can be computed as:\begin{equation*} F_{i}:= Q_{i}^{-1} S P_{i} \tag {16}\end{equation*}

View SourceRight-click on figure for MathML and additional features.The Absolute Pose Error is evaluated by calculating the root mean square error (RMSE) of the translation components across all time indices:\begin{equation*} \text { RMSE}(F_{1:n}):= \left ({{ \frac {1}{n} \sum _{i=1}^{n} \|\text {trans}(F_{i})\|^{2} }}\right)^{1/2} \tag {17}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
ATE can effectively measure the accuracy and consistency of reconstruction systems, providing a standardized evaluation tool for the comparison of different algorithms.

Alignment Error (AE): AE [202] is a comprehensive metric that balances the effects of scale, rotation, and translational drift on the trajectory. Suppose p_{1}, \ldots, p_{n} \in \mathbb {R}^{3} represent the tracked positions from frame 1 to frame n. Let S \subset [1; n] and E \subset [1; n] represent the frame indices for the start and end segments, respectively, for which aligned ground truth positions \hat {p} \in \mathbb {R}^{3} are provided. Independently aligning the tracked trajectory with the start and end segments provides two relative transformations:\begin{align*} T_{s}^{gt} & := \mathop {\mathrm {arg\,min}} _{T \in \text { Sim}(3)} \sum _{i \in S} \left ({{ T p_{i} - \hat {p}_{i} }}\right)^{2} \tag {18}\\ T_{e}^{gt} & := \mathop {\mathrm {arg\,min}} _{T \in \text { Sim}(3)} \sum _{i \in E} \left ({{ T p_{i} - \hat {p}_{i} }}\right)^{2} \tag {19}\end{align*}

View SourceRight-click on figure for MathML and additional features.where \text {Sim}(3) represents the group of similarity transformations in three-dimensional space. The alignment error between the tracked trajectories when aligned at the start and end segments is given by:\begin{equation*} e_{\text {align}}:= \sqrt { \frac {1}{n} \sum _{i=1}^{n} \| T_{s}^{gt} p_{i} - T_{e}^{gt} p_{i} \|_{2}^{2} } \tag {20}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Surface accuracy: Surface accuracy measures the quality of algorithmic reconstruction by calculating the distance between the estimated surface and the ground truth surface. For the distances of all vertices in the reconstruction, there are five standard statistics: mean, median, standard deviation, minimum and maximum. A smaller value for this indicator indicates a better quality reconstruction.

1) Evaluation in Static Scenarios

In this section, we quantitatively compared the ATE and surface accuracy of the following algorithms: [6], [7], [8], [9], [10], [11], [22], [23], [24], [25], [26], [27], [74], [75], [83], [87], [122], [133], [203], [204].

As shown in Table 4, we tested different algorithms on four sequences from the TUM dataset: fr1/desk, fr2/xyz, fr3/office, and fr3/nst, and calculated the absolute trajectory error (ATE) for each algorithm on this dataset. By comparing the ATE results, we can assess the accuracy of the pose estimation of the algorithms and thereby evaluate their performance. The data in the Table is sourced from the respective papers, where “-” indicates that the corresponding data was not found in the paper. The results are reported with three decimal places of precision. From the experimental results, Hbrf-fusion [25] achieved the best results in the fr1/desk and fr3/office scenes and also performed well in the fr2/xyz and fr3/nst scenes. This is because it uses the dynamic implicit Hermite radial basis function (HRBF) as a method for representing continuous surfaces, unlike explicit methods such as Kintinuous [7], Voxelhashing [8], and ElasticFusion [10]. Additionally, NGEL-SLAM [133], which uses neural implicit representations, also achieved good results in the fr1/desk, fr2/xyz, and fr3/office scenes. The results show that implicit representation methods can better align reconstructed trajectories, reduce trajectory errors, and thus improve reconstruction quality. For evaluating the quality of surface reconstruction, the ICL-NUIM dataset provides ground truth 3D models for generating virtual scan sequences. We use the lr_kt0, lr_kt1, lr_kt2, and lr_kt3 sequences from the living room scene in this dataset as a benchmark for estimating the algorithm’s surface reconstruction performance. The surface accuracy (median distance) of each method is shown in Table 5. From the experimental results, the implicit representation method Hbrf-fusion [25] achieved the best results in the lr_kt0 and lr_kt1 scenes, and the second-best results in the lr_kt2 and lr_kt3 scenes. Its surface accuracy surpasses explicit representation methods such as KinectFusion [6], DVO SLAM [74], and Kintinuous [7]. The results indicate that the implicit representation method used by Hbrf-fusion can significantly reduce reconstruction errors and improve surface accuracy.

TABLE 4 Absolute Trajectory Error (ATE) of Different Algorithms on the TUM Dataset (m). The Best Results are in Bold and the Second Best Results are in Italics. The Best Results are in Bold and the Second Best Results are in Italics
Table 4- Absolute Trajectory Error (ATE) of Different Algorithms on the TUM Dataset (m). The Best Results are in Bold and the Second Best Results are in Italics. The Best Results are in Bold and the Second Best Results are in Italics
TABLE 5 Surface Accuracy of Different Algorithms on the ICL-NUIM Dataset (m). The Best Results are in Bold and the Second Best Results are in Italics
Table 5- Surface Accuracy of Different Algorithms on the ICL-NUIM Dataset (m). The Best Results are in Bold and the Second Best Results are in Italics

2) Evaluation in Dynamic Scenarios

In this section, we quantitatively compared the ATE and RPE of the following algorithms: [32], [33], [34], [35], [36], [38], [39], [43], [83], [84], [155], [156], [157], [158], [163], [165], [173], [197], [205], [206], [207], [208], [209].

We used dynamic sequences from the TUM dataset to evaluate the performance of dynamic reconstruction algorithms. The sequences in the sitting (s) category, where two people are conversing at a desk, are used to assess the robustness of the algorithms to slowly moving dynamic objects. The sequences in the walking (w) category, where two people are walking in an office scene, can be used to evaluate the robustness of the algorithms to quickly moving dynamic objects. Tables 6 and 7 respectively list the performance of outstanding algorithms on the TUM dataset in recent years. We use ATE and RPE as evaluation metrics for the algorithms. In the tables, static, xyz, and half represent different camera movement modes. The data in the tables are sourced from the respective papers, where “-” indicates that corresponding data was not found in the paper. The results are reported with three decimal places of precision. The best results are in bold and the second best results are in italics. From the experimental results, with the development of deep learning and the introduction of semantic information, models can achieve excellent results not only in low dynamic scenes such as Fr3_s_static, Fr3_s_xyz, and Fr3_s_half but also in highly dynamic environments like Fr3_w_static, Fr3_w_xyz, and Fr3_w_half, obtaining accurate camera trajectories. Examples include PLD-SLAM [205], RTCB-SLAM [39], RTDSLAM [207], SEG-SLAM [157], and DDN-SLAM [158].

TABLE 6 Absolute Trajectory Error (ATE) of Different Algorithms on the TUM Dataset (m). The Best Results are in Bold and the Second Best Results are in Italics
Table 6- Absolute Trajectory Error (ATE) of Different Algorithms on the TUM Dataset (m). The Best Results are in Bold and the Second Best Results are in Italics
TABLE 7 Results of Relative Pose Error (RPE) in Translation Error for Different Algorithms on the TUM Dataset (m/s). The Best Results are in Bold and the Second Best Results are in Italics
Table 7- Results of Relative Pose Error (RPE) in Translation Error for Different Algorithms on the TUM Dataset (m/s). The Best Results are in Bold and the Second Best Results are in Italics

SECTION VI.

Conclusion

This study investigates and analyzes indoor scenes, classifying them into static and dynamic scenes, and provides a comprehensive survey of recent reconstruction algorithms.

For static scenes, we outline the general reconstruction process and introduce various optimization algorithms employed in each step. From the reviewed methods, it can be seen that traditional static reconstruction tasks often use explicit scene representations, such as voxels and surfels, which enable real-time scene reconstruction but result in artifacts and holes. This is due to the discontinuous nature of explicit representation methods. The emergence of neural radiance fields (NeRF) provides an implicit representation for 3D reconstruction, making the scene representation more continuous. However, because it requires densely sampling points in space and using MLP to learn scene information, it consumes a lot of training resources and time. Currently, to balance training time and quality, a promising direction is to combine explicit and implicit representations, such as Point-NeRF [210], H2-Mapping [132], and NGEL-SLAM [133].

In contrast, dynamic scenes involve not only camera motion but also other moving objects, which may interfere with camera tracking. Therefore, for reconstructing dynamic scenes, it is necessary to identify dynamic objects through motion segmentation, eliminate or utilize dynamic features for pose estimation, and integrate the scene data into the reconstruction model. Currently, a popular approach is to use deep learning methods to leverage semantic information of the scene to segment dynamic objects, thereby improving the quality of scene reconstruction. Additionally, NeRF also provides a new direction for dynamic reconstruction. Because NeRF uses MLP to represent the scene more continuously, it can even fill in information that the camera has not observed.

Declaration of Competing Interest

The authors have no competing interests to declare that are relevant to the content of this article.

References

References is not available for this document.