Introduction
Making a robotic system fully autonomous is an important problem for modern researchers [1], [2], [3], [4]. In order to operate autonomously, the system needs to know its position on the map (environment). In the case of an indoor or underground environment, global position estimation like GPS is unable to work, so a robotic system has to estimate its position using its own sensors. If the environment is unknown, a map also needs to be built. To solve this problem, Simultaneous Localization and Mapping (SLAM) techniques are needed.
Common SLAM methods like ORB-SLAM [5] build sparse maps because they only map the coordinates of the features extracted from input images. These sparse maps are inappropriate for path planning, as a planner may create a path that intersects unmapped obstacles.
Dense SLAM methods like RTAB-MAP [6] or Cartographer1 store maps as occupancy grids, point clouds, or OctoMaps [7]. In large environments, these maps require large amounts of memory for storage, which may cause the robot’s memory to overflow [8]. Additionally, as the map grows in size, the computational costs of map updating and path planning also increase. This can lead to delays in map updates, failures in loop closure, and the accumulation of errors.
One way to address this issue is to use topological structures to represent a map (see Figure 1 for an example). Examples of such topological structures include Voronoi graphs [9], graphs of rooms [10], or graphs of free space clusters [11]. These graphs enable high-speed path planning and significantly reduce memory consumption [10], [11]. An example of a topological and metric map built using the method [12] is shown in Figure 2, along with the sizes of both types of maps.
An example of metric map (gray) and topological map (red vertices and black edges).
Hierarchical metric-topological map built with Hydra [12], consisting of semantically annotated 3D mesh and four-level graph (places, objects, rooms, and buildings). The size of four-level topological graph is much smaller than the size of the 3D mesh.
In recent years, a large number of topological mapping methods have emerged. Some of them, such as [10] and [13], construct a graph-like representation of the environment in conjunction with a conventional metric map. Some other methods, like [14] and [15], build pure topological maps without relying on metric coordinates. Instead, they link places in the graph using images captured from those locations. And methods like [12] and [16] build hierarchical representations of the environment, incorporating semantically annotated 3D metric maps and multi-level topological graphs.
To ensure successful robot navigation, it is crucial to have connected topological graphs that accurately represent the environment’s topology. When using an unconnected topological graph, a planner is unable to generate a path from one area of the environment to another. Additionally, if the topology of the map is incorrect, such as a corridor bifurcation (Figure 3), a planner may create paths through non-existent passages in the real world, potentially resulting in collisions.
An example of incorrect topological mapping, which often occurs due to failed loop closure and robot positioning errors. A single corridor is mapped twice (green and violet arrows).
So, the topological graphs built for robot navigation should be evaluated taking into account connectivity and correctness of the topology (one-to-one correspondence between real-world areas and places in the topological graph). However, in most of the works related to topological mapping, these values are not taken into account. For example, in [10], only planning speed and path shortness are measured. In [14] and [15], only the number of vertices of the topological graph and navigation success are estimated. In [12], the evaluation metrics include object localization errors, object registration percent, room segmentation quality, number of loop closures, and performance values.
To bridge this gap, we propose a set of novel evaluation metrics for topological graphs. These metrics assess connectivity, consistency, and navigation efficiency within the given graph. These metrics are described in Section IV. With these metrics, we conduct an extensive experimental evaluation of five open-source topological mapping methods in terms of topological graph consistency. These methods are Hydra [12], S-graphs+ [16], IncrementalTopo [17], ETPNav [18], and TSGM [19]. For the evaluation, we use 15 scenes from the Matterport 3D dataset [20] in the Habitat simulator [21].
The rest of our paper is organized as follows. Section II presents a broad review of SLAM, mapping, and map evaluation methods. In Section III, a problem of topological mapping is formulated. In Section IV, we discuss the problem of topological map building and introduce the quality estimation metrics for it. Section V provides a detailed description of the methods Hydra, S-graphs+, and IncrementalTopo. In Section VI, we present an extensive experimental evaluation of the methods discussed in Section V, utilizing our novel graph evaluation metrics. Finally, in Section VII, we conclude our paper.
Related Work
A. Mapping
Map building is crucial for mobile robot navigation in unknown environments, and a significant amount of work is dedicated to developing mapping methods. One popular solution for map building is the utilization of localization methods along with mapping, which is known as Simultaneous Localization and Mapping (SLAM).
A common approach to mapping involves constructing a spatial representation of the environment in the form of a 2D or 3D metric map. Local metric maps can be obtained directly from the robot’s perception sensors and then merged into a global metric map. For instance, KinectFusion [22] builds a dense 3D map using the Microsoft Kinect RGB-D sensor. Open-source mapping methods such as GMapping2 and Hector-SLAM [23] construct 2D occupancy grid maps using 2D laser scans from the robot’s lidar. Methods like RTAB-MAP [6] and Cartographer [24] build both 2D occupancy grids and 3D point cloud maps from various sensor types (RGB-D camera, 2D or 3D lidar) while simultaneously performing correction of robot pose estimation and loop closure. The Voxblox method [25] generates a dense 3D map in the form of a Euclidean Signed Distance Field (ESDF) from input point clouds acquired from a 3D lidar or an RGB-D camera. Voxgraph [26], on the other hand, fuses small ESDF submaps constructed by Voxblox into a global ESDF map to mitigate odometry error accumulation and enhance loop closure efficiency.
However, while mapping a large environment over a long period of time, odometry error accumulation and map size growth may cause high memory consumption and inaccurate mapping. In work [8] two state-of-the-art metric mapping methods, RTAB-MAP and Voxgraph, were evaluated in a large indoor environment. After the robot traveled 1 km with an odometry drift, memory consumption for both methods exceeded 1 GB, and with RTAB-MAP, memory consumption exceeded 8 GB in some tests. Additionally, due to the accumulation of odometry drift, the maps were built incorrectly, with “bifurcated corridors” (see Figure 4).
Maps built in a large indoor environments under different degree of odometry drift, for RTAB-MAP (left) and Voxgraph (right) methods. With high odometry drift, both methods failed to close loops, which lead to corridor bifurcations.
Another popular approach to environmental representation is the construction of topological maps. Some methods, such as [11], [27], and [28], build a topological graph offline using a pre-built metric map. Another family of methods, including [10], [17], [29], [30], [31], [32], constructs a topological graph paired with a metric map from raw sensor data. In the method [10] a topological map represents a graph of rooms, with a metric map built for each room. In the method [29], the topological map represents a graph of small-size TSDF (Truncated Signed Distance Function) submaps. The methods [17], [30], [31], [32] represent the topological map as a graph of free space points, where an edge denotes a straight-line path between two points. The metric maps in the works [31], [32] are built locally for obstacle avoidance. In the work [17], a global 2D metric map is built online as a basis for topological map construction.
Another popular approach in topological mapping is the creation of a detailed hierarchical environment model with a 3D metric map and a multi-level topological graph. The method [12] builds a metric-semantic-topological scene representation, which includes semantically annotated 3D meshes and a graph of locations, objects, and rooms. The method [16] constructs a similar topological structure, but with a 3D scene point cloud and a graph consisting of locations, walls and rooms. The method [33] builds a hierarchical topological map with rooms, space regions, and free space clusters. Lastly, the method [34] builds a TSM (Topological Scene Map) with a graph of rooms, doors, and objects and a global 2D metric map.
In recent years, there has been significant progress in deep learning techniques, leading to the evolution of topological mapping methods based on learned features. For example, method [15] constructs a graph of locations with visibility-based edges to solve Object Navigation task. Each vertex in the graph is associated with a 360-degree image captured from its corresponding location, and vertex connectivity is computed using a ResNet-based neural network. In the work [35], a graph of keyframes is built as a topological map, with CNN-predicted features assigned to each keyframe. In the work [18], a topological map consisting of locations connected in case of direct visibility is built for solving the Visual-Language Navigation task. New nodes of the topological graph are predicted by a transformer-based network and are attributed with neural network-predicted features. And in work [14], a purely topological map is built for long-term visual navigation. Each location in the topological map is associated with the image captured from the location, and navigation is performed by a neural network that predicts the relative pose between two images. The topological graph is updated based on the reachability of locations. In the work [19], a purely topological map is built for image-goal navigation. Each location on this map is assigned a neural network-predicted embedding of the corresponding panoramic image. The method operates solely with panoramic images as input, ignoring the robot’s position information. Localization is performed using the panoramic image embeddings. Besides a graph of locations, an objects graph is also built using MaskRCNN [36] object detector. The graph of locations is updated using the objects graph, while the objects graph is also updated using the locations graph.
In our work, we focus on online topological mapping methods that build a graph of locations during robot motion. For our experimental evaluation, we chose five open-source methods: [12], [16], [17], [18], [19]. Each of these methods builds a graph of free space points, which are connected by an edge if there is a straight-line path between them. We assess these methods using multiple indoor scenes and conduct a comparative analysis of them.
B. Navigation
For mobile robot navigation in a mapped environment, the most common approach is the use of path planning algorithms to create a path to the target point and a local path follower to follow this path. In the case of global 2D grid maps, grid-based path searching algorithms like A* [37] or Theta* [38] are widely employed. In partially known 2D grid maps, dynamic algorithms like D* [39] or sampling-based algorithms like RRT [40] are typically used. These path planning algorithms give short and collision-free paths that are suitable for robot navigation.
However, in large environments where the grid size can reach thousands of cells, path planning times become excessively long and inefficient for real-time operation. For instance, path planning using the RRT algorithm took 2s in the work [11] and 67s in the work [10].
In contrast, topological maps simplify the task of path planning, as it can be efficiently solved using graph path search algorithms such as Dijkstra [41] or A* [37]. However, the challenge lies in effectively following the generated path. Some approaches address this issue by utilizing local metric maps for local motion planning and obstacle avoidance. For example, in works [31] and [42], a local 2D map is used, and in work [32] a local traversability map derived from an elevation map is used. In works 14]–[15, and [18], path following is performed by a neural network that predicts a relative pose between two locations. In the method [15], agent motion is also performing by a learned policy which infers actions (forward, left, right) from a visual observation and a relative pose to the next location.
C. Map Evaluation
Evaluation of Simultaneous Localization and Mapping (SLAM) methods is a well-studied research field, with a vast amount of datasets [43], [44], [45], [46] containing precisely reconstructed 3D maps, as well as benchmarks [47], [48], [49], [50] for SLAM testing. However, a significant part of these benchmarks is focused on trajectory error estimation without paying attention to map quality. The quality of SLAM-builded 3D maps is evaluated in some works, like [50] and [51] as an average distance from each point of the estimated map to the surface of the ground truth map. Also, in work [52], mapping error is estimated in a similar way, but taking into account the robot’s trajectory and comparing points on rays casted from poses of the estimated and ground truth robot’s trajectory.
Such metric map evaluation techniques can be easily applied to different kinds of 3D maps built from different sensors. However, for topological mapping methods, where a metric map plays a second role or is completely absent, these techniques are inappropriate. Some topology-based map evaluation metrics were introduced by Schwertfeger’s team [53], [54]. They evaluated the quality of metric maps by comparing generalized Voronoi graphs [55] of estimated and ground truth maps. However, this approach is not suitable for topological mapping methods without a global 3D metric map, since the ground truth Voronoi graph for maps built by such methods is unknown.
In a number of works devoted to topological mapping [12], [16], [56], [57], a graph of rooms is built and evaluated using standard metrics like precision and recall. For room graph evaluation, a ground-truth room segmentation is used. Room graph quality is important for topological navigation, however, a lot of methods build a graph of smaller scale, known as a graph of locations, with multiple locations within a single room. This type of graph is more suitable for navigation because of its density and straight-line traversability along the graph’s edges. Therefore, it is crucial to estimate the quality of these graphs for successful robot navigation. In this work, we introduce novel metrics for evaluating the quality of topological graphs, which are presented in Section IV.
Problem Formulation
Consider a robot equipped with a perception sensor (like an RGB-D camera or a 3D lidar) and a sensor for pose estimation (like an IMU or wheel encoders) in a large indoor environment. Its task is to build a topological graph of the environment which can be used for future navigation. The robot moves through the environment along a certain trajectory denoted as
The output of a topological graph construction algorithm
Topological mapping problem: robot receives a local point cloud
The final result of a graph construction algorithm
Evaluation Metrics
There exists a number of metrics of map quality estimation, as described in Section II. However, most of these metrics are primarily designed for evaluating metric or hybrid metric-topological maps, or assessing the quality of room graphs. In order to evaluate a graph of locations in terms of navigational efficiency, metrics such as CLS [58] or DTW [59] can be used. These metrics measure similarity between a robot’s path and some reference path. However, our task setup does not assume a clearly defined reference path. Additionally, using the shortest path between two nodes as a reference path is not suitable, as there may be multiple paths of similar length with edges that are widely dispersed from each other (see Figure 6). Therefore, for our experimental evaluation of topological mapping methods, we present a set of topological graph evaluation metrics that measure connectivity, consistency, and navigation efficiency.
Two shortest paths between nodes A and B which lie far from each other. DTW or CLS metrics show low similarity value between green and red paths whereas they are both proper for effective navigation.
A. Connectivity
The first parameter in our evaluation is topological graph connectivity. It is crucial for successful navigation because in an unconnected graph, a path planner fails to create a path. We evaluate connectivity using several criteria. Firstly, we measure the number of connected components in a topological graph. All the components with three vertices or fewer are considered outliers and are excluded from our evaluation.
Next, we calculate part of the free space area covered by the main connectivity component of the evaluated topological graph. This metric shows how much space is available for robot navigation within the graph. We consider an area covered if it is observed from a vertex of the topological graph within a distance of
An example of topological graph evaluation in terms of connectivity. Left graph is connected, and right graph is unconnected, so there is no path from the room in the left corner to the room in the right corner.
B. Graph Consistency
In the case of long-term SLAM operation in large environments, odometry error accumulation can lead to severe mapping errors. In some cases, a single location in the environment may correspond to two or more places on the map, such as the “corridor bifurcation” shown in Figure 3. Such inconsistencies can result in localization and navigation failures.
To track such mapping disorders, we introduce a graph consistency metric. Our metric is based on the following concept: all pairs of vertices representing neighboring locations in the environment should be close to each other in the topological graph. To evaluate this, we first determine the poses (positions) of the topological graph vertices in the ground truth map. Next, we calculate the average distance between pairs of topological graph vertices that have similar positions in the ground truth map. The formal description of the metric is provided below.
Assume we have a topological graph \begin{equation*} (v_{x}, v_{y}) = M_{p_{v_{t}}} M_{\tilde {p_{v_{t}}}}^{-1} (\tilde {v_{x}}, \tilde {v_{y}}) \tag{1}\end{equation*}
An example of ground truth robot pose estimation. Ground truth vertex cooridnates
After reconstruction of the ground truth vertex positions \begin{align*} d(u, v) &= \sqrt {(u_{x} - v_{x})^{2} + (u_{y} - v_{y})^{2}} \tag{2}\\ N &= |\{ (u, v): d(u, v) < \delta \}| \tag{3}\\ \text {consistency}_{\delta } &= \frac {1}{N} \sum \limits _{d(u,v) < \delta } \frac {\min (d(u, v), d(\tilde {u}, \tilde {v}))}{d(\tilde {u}, \tilde {v})} \tag{4}\end{align*}
Estimation of the consistency metric. Red dots on the top view show the source vertex positions in the topological graph, and blue dots on the bottom view show the corresponding ground truth vertex positions. The consistency value is the mean ratio of the distances between closely located ground truth vertex positions (
Also, in order to evaluate navigation safety in the topologlcal graphs, we estimate the percent of edges passing through an obstacle. We use the reconstructed ground truth poses of the topological nodes for this:\begin{align*} P_{\text {obstacle}} = \frac {|(u, v) \in E: \text {cross}_{-}\text{obstacle}((u_{x}, u_{y}), (v_{x}, v_{y}))|}{|E|} \tag{5}\end{align*}
C. Navigation Effectiveness
Providing collision-free and efficient navigation is crucial for mobile robots. Therefore, paths within a topological graph should be both obstacle-free and short. In our evaluation, we sampled start-goal pairs within the free space area of the ground-truth map. We then find the nearest nodes in the topological graph to the start and goal positions and check for an obstacle-free path between them. We consider navigation successful if such a path exists.
To evaluate the navigation efficiency, we estimate the average success rate over the sampled start-goal pairs. Additionally, we calculate the Success Weighted by Path Length (SPL) metric, which represents the ratio between the path length in the topological graph and the shortest path length in the ground-truth 2D map. An example illustrating the calculation of these metrics is shown in Figure 10, and the formal description of these metrics is given below.
An example of estimation of Success and SPL metrics in a topological graph with sampling two start-goal pairs: A-B and C-D. Shortest paths in the ground-truth map are marked with dashed red line, and shortest paths in the topological graph are marked with solid black line. Edges of the topological graph are marked with thin blue lines.
Consider that we sampled \begin{align*} \text {Success} &= \frac {1}{N} \sum \limits _{i: \exists \text {path}(u_{i}, v_{i}, G)} \text {is}_{-}\text{free} (\text {path}(u_{i}, v_{i}, G)) \tag{6}\\ \text {SPL}& = \frac {1}{N} \sum \limits _{i=1}^{N} \frac {|\text {path}(s_{i}, g_{i}, M)| \text {is}_{-}\text{free}(\text {path}(u_{i}, v_{i}, G))}{|u_{i} - s_{i}| + |\text {path}(u_{i}, v_{i}, G)| + |v_{i} - g_{i}|} \tag{7}\end{align*}
Evaluated Methods
A. Hydra
Hydra [12] is a real-time SLAM method that utilizes the concept of a 3D Scene Graph to represent the environment. The 3D scene graph is a hierarchical, multi-layered environment representation proposed in the work [60]. An example of a 3D scene graph is provided in Figure 11. It consists of five layers: metric-semantic mesh, objects and agents, places (locations in the free space) and structures (walls, floors, etc.), rooms, and buildings. Each layer represents a graph, where the nodes are connected with each other and connected to the corresponding nodes of the upper and lower layers. For instance, each object is connected to the corresponding set of points in the mesh, and each room is connected to all the places located within it. The graph of places in the third layer represents the topological graph of the environment. For Hydra, we evaluate this graph and also take into account connections in the rooms layer.
Hydra builds the metric-semantic mesh using the Kimera [61] algorithm. Additionally, Kimera tracks the robot’s pose using visual-inertial observations and constructs the agents layer of the 3D Scene Graph. The objects and places layers are built using an algorithm [9], which is based on Generalized Voronoi Diagram. The room graph is constructed using TSDF maps. By applying “obstacle inflation” (decrease of all the values in the TSDF), the map is divided into connected components, with each component marked as a room.
For loop closure, Hydra involves a two-level hierarchical scheme. Each vertex of the robot pose graph is assigned a hierarchical descriptor, which consists of a conventional DBoW2-based [62] descriptor of the place as well as descriptors of the nearest objects and places. At the stage of loop closure detection, matches are first searched using place descriptors, then using object and DBoW2 descriptors. Loop closure verification and pose shift estimation are performed from bottom to top, first using the pose graph and RANSAC [63] algorithm, then using the object graph and TEASER++ [64] method. Once the loop closure is found and verified and the pose shift is estimated, the entire graph is deformed using the algorithm [65].
Hydra constructs a detailed and well-structured hierarchical representation of the environment, which can be valuable for solving various tasks such as delivery and other services. However, it requires significant computational resources, and graph deformations during loop closures can take several seconds. Furthermore, at the time of writing this paper, there is no navigation system based on the Hydra-built 3D Scene Graph. This presents an opportunity for future work in computational optimization and the development of a navigation pipeline. In our study, we evaluated the third layer of the 3D Scene Graph, specifically the place layer, as a topological graph. We also considered connections in the room layer to enhance the graph’s connectivity.
B. S_graphs+
Method S_graphs+ [16] builds a hierarchical 3D scene representation. However, unlike Hydra, its representation consists of four layers: keyframes, walls, rooms, and floors. An example of such a representation is shown in Figure 12. The keyframes layer contains a graph of robot pose estimates attached to the corresponding keyframes. The walls layer contains a graph of wall planes, where connections denote adjacency between two walls. Additionally, each wall is connected to the keyframes from which it is observed. The rooms layer contains room nodes that are connected to each other in the case of door existence and connected to the walls they are bounded by. The nodes of the floors layer contain the center of the corresponding floor and are connected with all the rooms on the floor.
An example of a hierarchical map built by S_graphs method (taken from work [16]).
All the graph layers are created with constraints for graph optimization. The keyframes of the first layer are constrained using pairwise odometry measurements. The walls of the second layer are constrained using observations from the keyframe positions. The rooms of the third layer are constrained by the planes of the surrounding walls. The floors in the fourth layer add constraints for two-wall rooms (e.g., corridors) in the third layer.
Room segmentation is performed using wall planes and free space clustering. S_graphs+ builds a free space graph from an input ESDF map using the method described in [9]. The ESDF map is built using the Voxblox [25] method. The free space graph is divided into a set of clusters that form rooms. Clustering is performed by filtering out vertices of the free space graph whose distance to obstacles is less than a certain threshold - the connectivity components of the filtered graph become the clusters and form rooms together with the surrounding walls.
Similar to Hydra, the S-graphs+ method performs a two-level loop closure: “soft” (using room segmentation) and “hard” (using scan matching). It also builds a multi-level hierarchical representation of the environment, but without a semantically annotated mesh. Due to the absence of semantic information, S_graphs requires fewer computational resources than Hydra and performs graph optimization faster. The room graph built by S_graphs is suitable for high-level task planning, and the free space graph built during the room segmentation process can be used as a topological graph for robot navigation. We use this free space graph for our evaluation.
C. Incrementaltopo
IncrementalTopo [17] builds a 2D topological map from a 2D grid map. The 2D grid map is constructed by merging local 2D grids obtained from the robot’s perception sensors, such as a lidar or an RGB-D camera. It is then transformed into a distance map using a Gaussian kernel. The resulting distance map is further processed using a Laplacian filter and a thinning algorithm to obtain the skeleton graph. The process of building the maps in IncrementalTopo is illustrated in Figure 13.
Topological map construction by IncrementalTopo method: distance map extraction, edge search, skeletonization, and topological map creation. Taken from work [17].
For incremental map update, the built skeleton graph is compared with the local occupancy grid obtained from the current sensor measurement. During this comparison, the connectivity between the new skeleton part and the previous global skeleton is checked. The skeleton in the update area is updated keeping connectivity. For topological graph update, a local topological graph is constructed from the local skeleton in the update area. Its vertices near the update area frontiers are then stitched with the old graph vertices, and the vertices of degree 2 are removed with corresponding edge straightening.
IncrementalTopo is a simple and straightforward method for building a topological graph from laser or RGB-D scans. However, it does not include a pose correction technique. As a result, in the presence of noisy odometry, the resulting topological map may become disconnected. In Section VI, we thoroughly evaluated the topological maps built under different levels of noise.
D. TSGM
TSGM (Topological Semantic Graph Memory) is a topological mapping method designed for solving Image-Goal Navigation task. The scheme of this method is shown in Figure 14. The method takes RGB-D observations and the target image as inputs and constructs a graph that encompasses both images and objects in order to reach the goal depicted on the target image. The image and object graphs are fused together using a neural network-based cross graph mixer and are fed into an action policy network to generate an appropriate action.
For image graph construction, an embedding is generated for each observation using a neural network. Localization is performed measuring similarity of these embeddings and the embeddings of the nearby objects. If an image is successfully localized, it is used to update the corresponding graph node. Additionally, the corresponding graph node is linked with the last localized node. If localization fails, a new node corresponding to this image is added into the graph. This new node is connected to the previous image node. Objects are detected using a Mask R-CNN detector and are connected to their corresponding image node within the graph.
TSGM is a learning-based method that constructs a topological graph solely based on RGB-D observations, without any positional input. This approach may cause false localizations. Consequently, if the camera parameters deviate significantly from those present in the training set, the method may build graph with spurious edges. So, to evaluate TSGM accurately, we adopted the same camera parameters as those used in the training set.
E. ETPNav
ETPNav (Evolving Topological Planning for Vision-Language Navigation) [18] is a method designed for Visual-Language Navigation (VLN) task which builds a topological graph during operation. In the resulting topological graph, each node represents a viewpoint and is associated with its position and an embedding derived from the corresponding panoramic RGB-D image. Nodes are connected by edges if there exists a direct line of sight between them. The embeddings of the panoramic images are fused with the text embeddings of the task instruction and are then fed into an RL policy network which generates actions. A scheme of the topological mapping part of the ETPNav method is shown in Figure 15.
A scheme of the topological mapping part of the ETPNav method (taken from work [18]).
At each step of the navigation process, a transformer-based neural network is utilized to predict several nearby waypoints, which represent potentially accessible locations near the agent. These waypoints are stored as “ghost nodes” in the graph, an their positions are estimated using the current agent’s position, waypoints’ angles, and depths. Localization is performed using solely positional information. For localizing the current node, the graph node with minimal Euclidean distance is selected. To localize a ghost node, the nearest ghost node is chosen in the graph. This nearest node is linked with the localized node by an edge.
ETPNav is a straightforward topological mapping method that operates on raw sensor data, without utilizing metric maps to construct a graph. However, the topological mapping module in ETPNav is designed for solving short navigational tasks specified by a text instruction, not for long-term navigation with external robot motion algorithm. Furthermore, similar to IncrementalTopo, ETPNav does not incorporate a pose correction module, so its localization relies on raw position data only. Additionally, ETPNav employs a neural network for waypoint prediction that is sensitive to camera parameters, necessitating either network fine-tuning or input image transformations. Consequently, in our experiments, we adopt the same camera parameters as specified in the author’s configs.
Experiments
A. Experimental Setup
For our experimental evaluation, we utilized the Habitat simulator [21] and a sample of 15 scenes obtained from the Matterport3D dataset [20]. The maps, scans, and trajectories for two scenes from this sample are depicted in Figure 16. The key characteristics of these scenes are presented in Table 1.
Maps and panoramic RGB-D images taken from two scenes of our dataset sample. Robot’s trajectores are shown as blue lines on the maps.
At each scene, a virtual agent moved along a predefined trajectory. At each step, the agent received an observation consisting of a point cloud created from a panoramic RGB-D scan and a pose estimation. The motion speed was set to
We carried out two series of experiments. First, we assessed all the considered methods using ground truth position data as input. Next, to model real odometry sensors, we added Gaussian noise to the position data. The linear standard deviation of the noise was set to 0.0075, and the angular standard deviation of the noise was set to 0.015. Linear noise was added to the agent’s linear speed, and angular noise was added to the agent’s angle. Both linear and angular noises were updated every 10 seconds.
ETPNav and TSGM are based on neural networks, so they are sensitive to the input image parameters. To check the robustness to image parameters change, we ran TSGM on the first scene of our dataset with our experimental setup. We observed that locations positioned at opposite ends of the map were erroneously connected as proximate locations (see Figure 17. Consequently, we further evaluated the TSGM and ETPNav methods employing the camera parameters specified in the authors’ configs. As TSGM does not operate positional data, we attributed the nodes of the graphs built by TSGM by their ground truth positions.
An example of a topological graph built by TSGM method on our experimental setup (which is different from the author’s neural network training setup). Red dots are the graph nodes, blue lines are the graph edges, aligned with the ground-truth 2D map.
B. Height Thresholding for Hydra and S-Graphs+ Methods
Hydra and S-graphs+ construct topological graphs in 3D, so these graphs may contain high edges passing above obstacles. To address this, we perform 3D-to-2D projection of the generated graph using various height thresholds:
Height thresholding of the topological graph built by Hydra [12] method. Solid lines denote edges with both ends lower than 1m, and pale lines denote higher edges. Red lines denote the edges passing through an obstalce, and blue lines denote other edges.
For both Hydra and S-graphs, we evaluate connectivity, coverage, and percent of the edges crossing obstacles using all height thresholds. The metrics values for both methods with respect to the height threshold are depicted in Figure 19. A significant increase in the number of components and a drop in coverage values start with a threshold of
Impact of height threshold on Hydra (left) and S-graphs+ (right) maps connectivity, coverage, and part of the edges crossing obstacles.
C. Results
We evaluated three classic topological mapping methods (Hydra, S-graphs+, and IncrementalTopo) as well as two learning-based methods (ETPNav and TSGM) on a large indoor dataset with and without positional noise. An illustrative example showcasing the graphs generated by each method in the case of noisy positioning is shown in Figure 20. It is worth noting that both Hydra and S-graphs+ constructed dense graphs, while IncrementalTopo produced a sparse graph. However, nearly all the graphs built by these methods became disconnected. For learning-based methods, graph density and connectivity also differed significantly. ETPNav built dense but disconnected graph. Conversely, TSGM built sparse and connected graphs in all the scenes of our dataset.
Topological graphs built by Hydra (top-left), S-graphs+ (top-middle), IncrementalTopo (top-right), ETPNav (bottom-left), and TSGM (bottom-right) methods with noised position data. The height threshold for Hydra and S-graphs+ methods is set to
The quantitative results of the experiments both with and without positional noise are presented in Table 2, and the detailed results of the experiments are presented in Appendix A. Our experiments revealed that Hydra and S-graphs+ methods exhibited minimal changes in coverage with the addition of positional noise. Nonetheless, these methods produced disconnected graphs and a high percentage of edges crossing obstacles. IncrementalTopo method demonstrated high consistency and a lower percentage of edges crossing obstacles. Conversely, it was susceptible to pose noise, resulting in highly disconnected graphs when built from noisy data. ETPNav method showcased negligible changes in connectivity and navigational success. However, the percentage of edges crossing obstacles increased twofold with the addition of pose noise. TSGM method remained unaffected by positional noise and constructed connected graphs. Nevertheless, it had a high percentage of edges crossing obstacles, leading to lower navigational efficiency.
Thus, none of these five methods is suitable for successful collision-free navigation out of the box. However, we hope that our contribution to the topological map evaluation methodology will assist the researchers in finding appropriate methods and developing novel solutions in topological graph construction.
Conclusion and Future Work
In this paper, we address the problems of topological map construction and evaluation. We explore five methods for constructing topological graphs: Hydra, S-graphs+, IncrementalTopo, ETPNav, and TSGM. To assess their quality, we conduct extensive experiments using a large indoor simulated dataset. The dataset includes both ground truth and noisy position data as inputs.
For evaluation purposes, we introduce a novel consistency metric that measures the proximity of the topological graph vertices whose ground truth positions are close. Additionally, we analyze other metrics such as connectivity, coverage, the proportion of edges crossing obstacles, and the success rate of navigation within the constructed topological graphs.
The experimental results reveal that all the evaluated methods besides TSGM exhibit high consistency but low connectivity, with the graphs remaining disconnected even when provided with ground truth input data. Furthermore, Hydra and S-graphs+ methods tend to have a higher proportion of edges crossing obstacles due to their 3D graph construction approach. TSGM builds connected graph in all the cases, however, a high percent of edges in this graph crosses obstacles. The success rate of navigation within the constructed graphs only slightly exceeds 30%. Consequently, future work should focus on enhancing topological graph construction methods.
Our plans for future research encompass the development of topological graph consistency metrics and evolving our own topological mapping method based on the methods considered in this work for building connected and consistent topological graphs.
Appendix ADetailed Experimental Results
Detailed Experimental Results
The number of connectivity components for both experiments is presented in Table 3, and the coverage of the main connectivity component is displayed in Table 4. The tables reveal that only the TSGM method successfully constructed connected graphs for all scenes. Conversely, the remaining methods, even in the absence of noise, produced disconnected graphs for the majority of scenes. Only S-graphs and ETPNav methods created a connected graph (consisting of a single component) in several scenes, and the IncrementalTopo method created a connected graph in one scene. However, on scenes 9 and 11, the S-graphs’s coverage was less than 0.3, indicating that the S-graphs+ method failed to construct a graph encompassing the entire scene.
Hydra, S-graphs+ and ETPNav methods demonstrated robustness to positional noise in terms on coverage values. However, the mean number of connectivity components for the Hydra method increased significantly, from 6.5 to 9. Conversely, the connectivity of the IncrementalTopo method got much worse with pose noise addition - the mean number of connectivity components increased from 8 to 19, and the mean coverage of the main component decreased from 0.91 to 0.38. This decline can be attributed to the absence of a pose correction module in the IncrementalTopo method. TSGM did not use position data, so its results on noise-free and noisy experiments remained the same. It successfully constructed connected graphs for all scenes with an average coverage value surpassing 90%.
The values of topological graph consistency with and without noise for all methods are shown in Table 5. When using ground truth position data, the consistency values for all methods were close to 1. The IncrementalTopo and ETPNav methods exhibited consistency values exceeding 0.99 across all the scenes because vertices of these topological graphs have ground truth coordinates. However, for the Hydra and S-graphs methods, the consistency value was 0.97 on average due to internal pose optimization. With the addition of pose noise, the consistency values of Hydra, S-graphs+ and ETPNav methods slightly decreased, but they still remained larger than 0.8 on average. For the IncrementalTopo method, the consistency values remained close to 1. This can be attributed to severe graph discontinuity, wherein the most of the vertices were not added to the graph, resulting in only a few pairs of closely located vertices being present. For TSGM, the consistency values were equal to 1 because the ground truth node positions were used for evaluation.
Table 6 shows the percents of the edges that crossed obstacles. For the Hydra and S-graphs+ methods, the percents of the edges crossing an obstacle were close to 0.2 in both noise-free and noised experiments. This occurred due to the 3D graph construction and the presence of edges passing above obstacles. In contrast, the IncrementalTopo method, which operates in 2D, had an average percent of the edges crossing obstacles of 0.03 without position noise and 0.06 with position noise. For ETPNav, value of this metric also increased twice with pose noise addition. This happened because ETPNav and IncrementalTopo had no pose correction module and set raw noised positions to the topological graph nodes. For TSGM, the average percent of edges crossing obstacles was also close to 0.2. This was due to TSGM did not localize graph vertices using coordinates, and neural network-predicted features of the input images provide false similarities.
The values for navigation efficiency (Success and SPL) are shown in Tables 7 and 8. For Hydra and S-graphs+ methods, the navigation success almost did not change with the addition of pose noise and reached values near 0.3. However, for the IncrementalTopo method, due to severe graph non-connectivity, the mean navigation success rate reached 0.35 with ground truth position data and 0.05 with noised position data. As for the TSGM method, despite having connected graphs, the navigation efficiency results were slightly poorer compared to S-graphs+ and Hydra methods. This can be attributed to a high percentage of edges crossing obstacles. ETPNav method showed the best navigation success rate among all the methods with ground-truth positions, and with addition of the pose noise, its success rate reduced dramatically - from 0.44 to 0.19. Such decrease could occur due to a twofold increase of the part of the edges crossing an obstacle.
For the SPL metric, a similar pattern emerged. The SPL values for Hydra and S-graphs methods changed insignificantly with the addition of the pose noise, while for the IncrementalTopo and ETPNav methods, SPL values reduced dramatically. The SPL values for S-graphs and ETPNav methods closely aligned with the success values in case of ground-truth position data, primarily due to the higher topological graph density. On the other hand, since IncrementalTopo constructs sparse graphs, its SPL values were considerably lower than the success values (0.13 compared to 0.35 for the noise-free experiment, and 0.02 compared to 0.05 for the experiment with noised positions).