Introduction
Indoor service robot is a robot to assist human beings in indoor environments, such as cooking, cleaning, shopping, home maintenance in house, or doing guidance in exhibition room. The demands on indoor service robots in public or private spaces are rapidly increasing recently. Compared with traditional industrial robots, service robots are relatively required to have human-like intelligence. As a fundamental ability for mobile robots, autonomous navigation integrates a wide variety of processes from low-level actuator control to high-level strategic decision making. One critical issue is to develop techniques helping service robots to perceive the surrounding and localize itself in space using different kinds of sensors including LiDAR, Radar, sonar, inertial measurement unit (IMU) and camera, called SLAM technique. In the past decades vision-based SLAM has highly attracted lots of researchers, which takes videos as input, and computes the camera position and 3D maps as output. There are three main advantages for vision used in SLAM [1]: 1) camera is compactness and power saving which can be embedded in any robot; 2) vision allows to develop a variety of essential functionalities in robotics via vision perception techniques, such as obstacle detection, human tracking, visual servoing; 3) vision easily helps to implement relocalization and bounded errors on the position estimates of robot in low cost via loop detection and correction. Moreover, dense map is not only full of important geometric structure information but also offering potentials for detailed semantic scene understanding, which is usually accompanied by dense SLAM algorithms and is used to remote control, navigation, virtual reality and so on. However, dense mapping extremely consumes computing resource and storage resource. Most of recent map-centric dense visual SLAM algorithms even require to use expensive and big GPU to have real-time performance. For a practical indoor service robot setup, if we distribute vSLAM systems in cloud framework, the service robots could benefit from the use of Cloud brain to possess stronger computing and storage ability for producing precise indoor mapping and localizing but with a small and compact size.
Several groups have recently made preliminary trials, and made partial progress in the cloud-based visual SLAM. These work followed the tracking and mapping framework [16], which split tracking and mapping into two separate tasks. So it seems easy to distribute the whole systems in a 2-level cloud framework [2], [3]: the tracking thread runs on clients, and collaborative 3D mapping and map fusion is conducted by exploiting multiple clients in the cloud. The effect of network delay were not taken any account to the system in the works, since only keyframes were sent to the Cloud to build 3D map, so tracking and mapping are separated completely. The robot clients are only data collector for the server, so mapping could be done offline in their cases. However, when tracking and mapping are completely separated between threads, the tracking would suffer from drifting easily. As they discussed in the paper, the crucial problem of cloud SLAM is that procedure is very sensitive to the network delay and network bandwidth.
As all we know, more precise mapping requires more precise localizing, and more precise relocalizing requires more precise mapping in reverse, and SLAM algorithms are highly constrained by real-time requirement. Recently one state-of-the-arts of vSLAM algorithms on a single board computer was published, called ORB-SLAM, and it benefits significantly from the fact that the tracking and mapping threads are sharing a map model and optimized camera poses [11]. However, if this algorithms is transplanted into cloud framework, the most difficult thing is to handle network delay brought to the real-time system. It is even too hard to provide a comparable result. So the problem what we are trying to solve in this paper is: 1) when tracking is separated from mapping, how can we keep the performance of tracking; 2) even if the optimized camera pose can be sent back to the robot easily, but it will not be real time, so how could it affect the position of the robot.
We present a framework and approach which takes advantage of the powerful cloud computing and storage to not only reconstruct dense maps, but also estimate robust visual odometry for indoor service robots. Our system has real-time tracking with CPU on robot clients, and dense mapping on the local computing cloud with GPUs, and dense map sharing on the central computing center. The system is applied to a local area network (LAN). LAN refers to a network that interconnected computers within a limited area such as a neighborhood, school, laboratory, office buildings, or a house, which usually has a relative low network delay. To the best of our knowledge, this is the first work of real-time cloud visual SLAM which produces the comparable results of visual odometry on a single board computer. Beyond a cloud vSLAM, it generates high precision dense maps in the local cloud, which can be shared among private cloud. The main contributions of this paper are: a cloud framework of visual SLAM system with sparse tracking and dense mapping, which runs robust tracking on ARM cores of robot clients in real-time, and does accurate mapping on GPU cores of local computing servers. Camera pose optimization based on keyframes runs on local computing servers in real time, then optimized camera poses are sent back to correct the tracking poses of robots via local optimization. Experimental results demonstrate our system is able to tolerate the network delay of local area network(LAN), for instance, a domestic WiFi.
The remainder of this paper is organized as follows: in Section II, we provide a brief background vision-based SLAM technique in related applications. In Section III, we give the system overview for our cloud vSLAM. In Section IV, V, VI, we describe the details of the system with respect to the parts in the client side and the cloud side. The experimental results and extensive evaluations are reported on the TUM datasets, ICL-NUIM datasets and our real scenes in section VII. Finally, we give the conclusion in Section VIII.
Related Work
Visual SLAM has been paid much attention in the computer vision and robotics community for the past few decades. In robotics field, visual odometry is the process of incrementally estimating the pose of vehicles from image streams captured by the onboard cameras, which is quite fundamentally related to visual SLAM. But visual SLAM does not only aim to the camera trajectory, but also the global consistent map. So far visual SLAM systems is evolved into complete and complicated systems, and usually consists of different techniques, including 3D geometric reconstruction, loop detection and closure, non-linear optimization, to improve the performance of systems and algorithms.
From the early beginning, monocular video was first used in visual SLAM system [8]. Davison and Murray tried sparse Harris features, and considered Extended Kalman Filtering to solve the visual SLAM problem. MonoSLAM was a real-time visual SLAM algorithm, which took videos from single cameras as input, and created online sparse but persistent map of features in the probabilistic framework [15], [38]. Planar patch features [10] and line features [9], [46] were considered in visual SLAM approaches beside key point features. And as the development of sensors, stereo video [44] and RGBD video [45] are utilized to compensate for the weakness of monocular vision methods in different applications.
Klein and Murray [16] presented a seminal work in visual SLAM, which splits the system into two parallel tasks, tracking and mapping. This framework is widely applied to speed up real-time visual SLAM systems [11], [18], [19], [26], [45], and is extended to the use of computationally expensive optimization techniques. Grisetti et al. explained a graph SLAM approach in [17], in which involved to build a graph to constrain the connect camera poses from sensor measurement, and optimized in a nonlinear framework.
Not only the sparse features extraction was used in visual SLAM, but also the dense pixels were used in [5], [18], [19]. Newcombe et al. assumed a dense model of scenes, and proposed dense tracking and mapping with whole image alignment against that model [18]. The system was implemented in real time as well. A dense RGB-D SLAM algorithm by minimizing both the photometric and the depth error over every pixels was proposed by Kerl et al. [5]. And they selected keyframes and detected loop closure using entropy based similarity measurement, and optimized the pose graph in g2o framework. In the work of [20], Kerl et al. focused on the problem brought by rolling shutter RGB-D cameras to the dense SLAM. The continuous trajectory representation was used to compensate the rolling shutter effect, and showed superior quality in tracking and mapping. Engel et al. utilized dense tracking, semi-dense map estimation and map optimization for large-scale SLAM in a probabilistic method [19]. No matter sparse or dense SLAM, loop closing is demonstrated to improve significantly the final SLAM results, and is utilized by most of the recent visual SLAM work [5], [6], [11], [14], [47]. And deep learning technology is explored in visual SLAM [21], [22] in the most recent literatures.
Cloud computing is a technique that provides shared computer processing resources and data through Internet connection to computers and other devices on demand. The most important benefit of cloud computing is data and computing resources sharing, so using cloud computing is a trend in applications such as social robots [23]. Visual odometry [25], vision based 3D mapping [13], [26], [37], robot navigation [27], [36] and multi-robots collaboration [30], [31] are the typical applications for cloud robotics. Limosani et al. presented a system in the cloud robotics paradigm to help autonomous robots navigating in indoor environment. ARTags and QR codes were applied to mark rooms, corridors, entrances, and atriums for localization. The references [28], [29] provided different cloud computing frameworks for robotics. In this paper, our purpose is to develop a cloud visual SLAM system not only offloads the expensive computing and storage, but also keeps performance without any loss for indoor service robots.
System Overview
In this section, we provide a system overview for our cloud visual SLAM. Our purpose is to present a cloud visual SLAM system, not only offloading the expensive computing and storage, but also keeping high performance as on stand-alone system even better for indoor service robots. We follow the tracking and mapping framework [16]. We implement the tracking with CPU in order to keep clients low cost, and the dense mapping on the local cloud computing center in LAN using the NVIDIA Compute Unified Device Architecture (CUDA) and Open Graphics Library (OpenGL) to obtain a high precise map. The CUDA and OpenGL API enables us to implement temporal filtering, point fusion on a GPU to reach real-time performance. For indoor service robots, the tracking part provides the camera simultaneous poses. While real-time tracking runs on the client, keyframes are selected and sent to the cloud server, once the corresponding camera poses are optimized, then the data will be sent back to the robot client to rectify trajectory drift of the robot in the local optimization. Fig. 1 shows the chart of our proposed system overview. Our system is a 3-level framework, including a robot client, a local computing cloud in LAN, and a private cloud on Wide Area Network (WAN). The cloud in our paper consists of the local computing server and the private cloud.
When frame is captured, we first extract ORB features, and estimate local camera pose initially after guided matching and optimize the camera pose in a local optimization. Keyframes are selected under some specific conditions, and sent to the local computing cloud with the corresponding robot state (if we have good feature matching and good estimation for camera poses, visual tracking succeeds, and denoted by 0, otherwise visual tracking fails, which means the robot is lost, and denoted by 1). When a keyframe is received by the local computing cloud, visual tracking and dense map fusion based on the keyframe is executed when the corresponding robot state is 0, otherwise visual relocalizing is launched. When relocalizing fails, a dense tracking based on depth images is utilized to estimate camera poses to enhance the robustness of the system. Loop detection and optimization is always carried out once a keyframe is received. When the dense map is built completely, it is first compressed and upload to a private cloud for storage and sharing.
Light-Weight Tracking on the Client
In our robot system, we use RGB-D cameras to capture images. To build light-weight tracking in the client side, we use the Oriented FAST and Rotated BRIEF (ORB) features [12]. This is a binary descriptor based on FAST keypoint detection, which enables feature matching on low-power devices without GPU acceleration. We take the first 3D depth measurements
A. Keyframe Selection
The framerate of stream from RGB-D cameras is about 25 frames per second. These frames contain a number of redundant information, so a lot of visual SLAM algorithms are based on keyframes, which are selected in an appropriate way. In our system when a frame is selected as a keyframe, it will be sent to the local computing server, so the keyframe selection is a crucial step. If too many frames are selected, it would bring much pressure to network transmission. If too few frames are selected, we can not have an effective reconstruction for the whole system. So it is not an uniform time selection, and we need to balance keyframe selection and network transmission. Under the demand of limited network bandwidth, we select keyframes as many as possible by the following rules:
: ID of current frame isC_{1} greater than ID of last keyframe;\delta _{1} : The ratio between the number of inliers and the number of all ORB features in current frame is less thanC_{2} ;\delta _{2} : The difference of the camera pose vectors between current frame and its closest keyframe in Euclidean space is greater thanC_{3} ;\delta _{3} : Tracking fails.C_{4}
Where
If
In the above expression, we can see for two consecutive keyframes, they will not overlap much controlled by
B. Local Model Update
Our local model
when a new keyframe is determined, a counter
is set for the keyframe (initially set as 0), whereC[n] is the index of the keyframe inn . New local map points from the\mathscr {L}_{G} th keyframe are created by triangulating ORB features from the connected keyframes in the local covisibility graph [11]. Before doing LBA, letn , we will give the details of LBA in the subsection IV-C.C[i]=C[i]+1, i=1,\ldots,n If 3D points and keyframes optimized in LBA can be seen by
keyframes, assumeK is the index set for the\mathbf {S} keyframes, then after doing LBA, the count numbers associated with theK keyframes doK , for allC[k]=C[k]-1 .k \in \mathbf {S} When the count number
is greater than a thresholdC[r] , then culling the\delta _{4} th keyframe from the covisibility graph, and also culling these map points can be seen uniquely in ther th keyframe.r
In fact, the count numbers measure how important keyframes to the current frame. If the number is big, it indicates less important, since we do not optimize the keyframe in LBA for a while, when it is bigger than the threshold
C. Local Bundle Adjustment
When the optimized camera pose
Dense Mapping on Local Computing Servers
Dense map is full of appearance and geometric information beyond sparse map, and it is meaningful for many applications, for example, robot path planning, indoor navigation. Once keyframes are received continuously from the robot clients, the dense map is incrementally built in the computing center cloud. In order to implement real-time dense mapping, we apply CPU-GPU hybrid programming.
We maintain a sparse model \begin{equation*} \mathscr {S}=\{\mathbb {S}_{p}, \mathbb {S}_{G}, \mathbb {S}_{c}\}\tag{1}\end{equation*}
The dense model is used for dense mapping, which includes a global dense map \begin{equation*} \mathscr {D}=\{\mathbb {M}, \mathbb {N}, \mathbb {C},\mathbb {D}_{d}\}\tag{2}\end{equation*}
The visual database is built incrementally for relocalization and loop closing, which records where the robot client has gone [32]. It includes a vocabulary tree and inverse index table for fast searching images according to words, where a ORB vocabulary dictionary is learnt ahead.
A. Keyframe Tracking
Our keyframe is a RGB-D frame, we consider motion tracking on both RGB keyframe and depth keyframe in this paper. When a keyframe is received by the local computing cloud and the attached flag is detected as 0, features are first extracted from RGB frames, and matched with the previous three keyframes. Then the current camera pose is estimated by EPnP algorithm [39].
If the attached flag is detected as 1, that means the camera tracking on the robot client is lost. Further, if it can not find a reliable candidate from visual relocalization, that means visual relocalization fails, then depth tracking is employed to register the depth keyframe with the dense model. This happens when there are few visual features or images with low quality features. If we know the last camera pose, we may project the last depth frame from the global depth map. Assume the current depth frame is transformed into the global coordinate with \begin{equation*} E_{icp}=\sum _{\mathbf {u}}||(T\mathbb {M}_{i}(\mathbf {u})-\mathbb {M}^{g}_{i-1}(\mathbf {u}))\cdot \mathbb {N}^{g}_{i-1}(\mathbf {u})||^{2},\tag{3}\end{equation*}
B. Local Mapping
When the keyframe with a zero flag is received, a new node will be created in the covisibility graph for the keyframe. In the local mapping part, we optimize the current keyframe with its connect keyframes in the covisibility graph, and all 3D points observed in these keyframes by local BA. To keep the 3D points in the sparse map as accurate as possible, we take the similar strategies as the processing in [11] in term of map points culling, new point creation and keyframe culling. The optimized camera pose associated with the current keyframe would be sent back to the robot client to correct the localization of the robot.
C. Loop Detection and Correction
Loop closing is always a critical step in SLAM, it is able to effectively suppress the trajectory drift with loop constraints. When a keyframe is received, it would be detected if it is a loop in parallel by DBoW2 [32]. If it has multiple loop candidates, it is determined as a loop stably, the loop correction is activated. Assume the keyframe is denoted by
D. Dense Map Fusion
In our dense map fusion part, the depth image is incrementally fused into a dense map frame by frame in real-time performance. For the dense map fusion, we not only add more points in the dense map, but also process point ghosts. We set a confidence number \begin{align*} \mathbb {M}^{p}_{k}=&\frac {c_{k}\mathbb {M}^{p}_{k}+\alpha \mathbb {M}^{c}_{k}(\mathbf {u})}{c_{k}+\alpha }, \\ ~ \mathbb {N}^{p}_{k}=&\frac {c_{k}\mathbb {N}^{p}_{k}+\alpha \mathbb {N}^{c}_{k}}{c_{k}+\alpha }, \\ c_{k}=&c_{k}+\alpha,\tag{4}\end{align*}
E. Sparse and Dense Model Update
Since the loop correction significantly improves the camera pose estimation, so the camera poses in our sparse model are optimized with the global constraints while a loop is detected, then the sparse map is updated accordingly. When the optimization of camera poses reaches a stable state finally, the dense map is rebuilt according to the final optimized camera poses and the depth keyframes in the dense model. We use subsection V-D to recreate a more precise dense map with GPU in a single thread.
Dense 3D Map Compression and Sharing On Private Cloud
Once a 3D dense map reconstruction finishes on the local cloud, it is uploaded to the private cloud for storage and sharing. This holds two benefits: 1) reduce the storage consumption on computing servers; 2) easy to share the 3D dense map with other clients. In fact uploading and downloading of dense maps between local computing cloud and private cloud does not require the real-time performance. In order to reduce the pressure of network communication and storage space, we compress the 3D maps with the Octree point cloud compression algorithm [49], then the compressed dense map is uploaded and stored on the private cloud. The Octree compression algorithm is a 0–1 coding algorithm, which is based on an octree decomposition of space, and achieve high compression rate.
Experimental Results and Analysis
We evaluate our system on the indoor datasets, and also test our system in five different real indoor scenarios. We take the robot ’Hori’ built by ourselves as the client, which has a laptop with Intel Core (i5-3210, 2.5GHz, 4G RAM, and no GPU). Our local computing cloud is a server with Intel Core (i7-4790, 4GHz, 16G RAM) and a stand-alone GPU (NVIDIA GTX970). The robot and the local cloud are connected via WiFi-5G network. The whole system is built on ROS. We keep the parameters the same in all experiments (
A. Data Transmission Between the Robot and the Local Cloud
We first introduce the structure for data transmission between the client and the local cloud. Time stamp is utilized as identification, then color image and depth image are attached. A 7D vector represents pre-estimate pose includes rotation in quaternion and translation. The last flag represents the state of the robot. Fig. 2(a) and (b) shows the data structures used in our system. As we can see, network bandwidth is consumed mainly by image data. Since we only send keyframes, so the network burden is reduced significantly as the work [2], [3].
The structures defined for data transport between the client and the server. The network consumption of the data is mainly from image data uploading.
B. Visual Odometry Evaluation on the TUM RGB-D Dataset
TUM dataset is a RGB-D SLAM benchmark, provided by Computer Vision Group of Technique University Munchen [33]. It consists of RGB-D data captured by Microsoft Kinect sensor (the first generation) and ground truth obtained by a high-accuracy motion-capture system with eight high-speed tracking cameras for evaluating visual odometry and visual SLAM system. We first test visual odometry of our system on the robot side.
ORB-SLAM2 is a real-time state-of-the-art system running on a single computer [11]. We compare our system with ORB-SLAM2 on 20 sequences of the TUM RGB-D benchmark to evaluate camera tracking. And we also report camera tracking on the robot without the optimized poses received from the computing cloud, because camera tracking without the optimized poses was employed in the existing cloud SLAM frameworks [2], [3]. The table 2 reports the root-mean-square error (RMSE) of trajectory recovery on the robot client. In all sequences, the tracking in our system provides comparable results with ORB-SLAM2. And ORB-SLAM2 and our system show better performance than the results without the local cloud optimization. Because we have keyframe optimization on the local computing cloud, and the drift of trajectories is controlled effectively.
We also report the camera localization for keyframes of our system running on the local computing cloud server in Table 2. From the table, we can see that our system achieves comparable performance with ORB-SLAM2. Although we have more keyframes generated, this part in our system runs on the local computing cloud, so the processing speed is not a problem at all. Moreover, our algorithm requires less memory and less computing consumption on the robot, which can be implemented on embedded system. We will evaluate the memory requirement in the subsection VII-E.
C. Map Evaluation on ICL-NUIM Dataset
We evaluate 3D map reconstruction generated from our system with DVO SLAM system [5], RGB-D SLAM system [33] and ElasticFusion [6] on ICL-NUIM dataset. The ICL-NUIM dataset is a resent released dataset by Handa et al. of Imperial College London and National University of Ireland [42]. This dataset does not only provide ground truth poses for benchmarking visual odometry but also for the 3D surface reconstruction to evaluate surface reconstruction accuracy. Because only the four sequences of Living room have 3D surface ground truth, and the rest four sequences in office room scene do not have 3D model with it. So all four sequences in the living room scene are used to evaluate mapping in our system. Table 3 summarizes 3D map reconstruction result. The numbers in the table are the mean distances from each point to the nearest in the ground truth model. It is seen that our 3D reconstruction results are superior to all other systems in the four sequences. The ’lr kt3’ triggers a global loop closure in our system, so the results from our system is significantly improved compared with other methods. Note that the number of map points from our system might be less than ElasticFusion due to the fact that the dense mapping running on the local cloud server not on the sequential frames but on keyframes.
D. Performance on the Real Data
The five real scenes data captured by our robot includes Exhibition room, Apartment 1, Apartment 2, Workspace, and Lab. The details of the datasets are reported in Table 4. The first row shows the duration of the sequences, the second row shows the total number of frames. The last row reports the number of keyframes selected and sent to the local computing cloud by our system. Fig. 3 (a)(b)(c)(d)(e) shows some example images from the real data in the view of the robot. Since we do not have ground truth, we only qualitatively compare the dense mapping results with ElasticFusion [6]. Fig. 4, Fig. 5 and Fig. 6 shows the dense maps reconstructed by ElasticFusion and our proposed system in three different views. The first row of Fig. 4(a)(b), Fig. 5(a)(b) and Fig. 6(a) shows the results of ElasticFusion in three different views. The second row in each are produced by our systems in top view, horizontal side view and tilted view, respectively. From the details of the maps we can see, our method produces more accurate dense maps than ElasticFusion, because we have flatter floors, and more square rooms for all of five sequences. We print robot trajectories produced by our method on the dense maps in green as well.
3D dense map generated by ElasticFusion and our proposed system on Exhibition Room and Apartment 1 in three different views. (a) shows the comparison results on Exhibition Room, (b) shows the comparison results on Apartment 1. The first row of (a)(b) show the results of ElasticFusion in three different views. The second row of (a)(b) are produced by our systems in top view, horizontal side view and tilted view, respectively. The green trajectories in the dense map are the camera poses generated by our system.
3D dense map generated by ElasticFusion and our proposed system on Apartment 2 and Workspace in three different views. (a) shows the comparison results on Apartment 2, (b) shows the comparison results on Workspace. The first row of (a)(b) show the results of ElasticFusion in three different views. The second row of (a)(b) are produced by our systems in top view, horizontal side view and tilted view, respectively. The green trajectories in the dense map are the camera poses generated by our system.
3D dense map generated by ElasticFusion and our proposed system on Lab in three different views. The first row show the results of ElasticFusion in three different views. The second row are produced by our systems in top view, horizontal side view and tilted view, respectively. The green trajectory in the dense map is the camera poses generated by our system.
E. Memory Evaluation on the Robot
One benefit of our system is that we only require a very low memory load in the robot client since we remove the computing and memory consumption to the local computing cloud. In the subsection, we analyze the memory load on the client side to demonstrate the advantage of our system. We compare our memory consumption with ORB-SLAM2 algorithm on the longest sequence of the real data Apartment 2, which has 15545 frames totally. Fig. 7(a) shows the memory load comparison result. The axis x denotes the frame ID, the axis y denotes the number of points. The red line is from ORB-SLAM2, and the green line is generated from our system. As we can see, the memory load goes up as the frames processed in ORB-SLAM2 algorithm, but the memory load always keeps in a limited size in our system since we only maintain a local limited model with local constraints. But our system does not suffer from the performance degeneration, that is because the optimized camera poses from the local cloud server exert the effect on the trajectories on the robot client.
F. Network Delay Analysis Between the Robot and the Local Cloud
One of the big problems for the cloud visual SLAM is to treat network delay between the robot and the local computing cloud server. In this subsection, we analyze the network delay of our system in detail for the application of indoor service robot. In our experimental setup, the robot client is connected with a 5G WiFi router, and the router is connected with the local computing cloud server with a cable, which is a very common LAN. We observe that the most network delay comes from the robot and the router via wireless transmission, and there only exists very few network delay between the router and the local cloud server. Our system has a real-time tracking on the robot (around 25 fps), and optimized camera poses associate with keyframes from the cloud are used to correct tracking trajectory through the optimization in local BA. As long as the optimized camera pose is in the local sparse model, it exerts an effect on the tracking. The ratio between the keyframes and the frames in five real data is reported in the last row of Table 4, which shows it is nearly 2 keyframes selected per second. In most of the time, assume we maintain about 100 keyframes in the local model on the robot, which means the optimized camera pose could be the keyframe sent about
In order to demonstrate how our system is able to bear the network delay of LAN, we do the experiments as follows. We measure the keyframe processing time on local computing cloud denoted by
Network delay measurement and analysis.
G. Map Compression and Transmission Between the Local Cloud and the Private Cloud
We use Octree point cloud compression algorithm before data transmission. We set three different compression resolution parameters for the algorithm —
Conclusion and Future Work
In this paper, we present an online cloud visual SLAM system which is suitable for LAN environment. Our system is different from ORB-SLAM2 in the following aspects: 1) ORB-SLAM2 is a real-time system on single PC, while our system, considering the network delay, is a real-time system on a network framework. It distributes the sparse tracking and dense mapping separately in the robot client and the local computing cloud. 2) ORB-SLAM2 does not output dense map explicitly, but our system does. 3) We have different keyframes selection strategies. More keyframes are sent to the local cloud, more accurate map is built, but requires more bandwidth for the real-time performance. Our keyframe selection is denser, and we balance the bandwidth volume and 3D map reconstruction accuracy. 4) In the tracking of the robot client, we maintain a local model for local optimization with the updates from cloud side to reduce drift of tracking part, and provide a dynamic selection strategy for the local model. 5) In the dense mapping, we apply not just the feature-based reconstruction, but also a depth registration to enhance keyframe tracking.
Moreover, the map centric method — ElasticFusion can build dense map with the consecutive depth frames, but we build dense map only up to depth keyframes, and the accuracy of our dense map is higher than ElasticFusion from the quantitative evaluation.
With a robust visual odometry running on the robot, an accurate dense map is generated on the local computing cloud, our system can be used to reduce the cost of the robot client potentially. And it is easy to share the dense map on the private cloud with different clients. As all we know, our system is resistant to the network delay of LAN, so it provides an effective vSLAM solution for indoor service robots. Our future work is to extend our cloud SLAM system to outdoor environments, and make the system applicable to a larger scale area, which can be used in such as intelligent vehicles.
ACKNOWLEDGMENT
(Yali Zheng and Shinan Chen contributed equally to this work.) Shinan Chen was with the School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China.