I. Introduction
In real-world environments, autonomous robots need to robustly perceive the surrounding environment and obtain information to guide subsequent tasks such as planning and scene interaction [1]–[7]. However, the existing literature on the robot mapping and scene reconstruction often build semantic map of the scene [8]–[10], which mainly emphasize mapping accuracy but neglect the instance-level segmentation. For autonomous robots in indoor environments, the perception and reconstruction only at the semantic level is not enough, which prevents a robot from the reconstructed scenes to accomplish more complex and detailed tasks. Therefore, it is necessary to build the instance-level semantic map of the scenes, wherein each object in the map can be distinguished with its own unique properties, even if it belongs to the same category. Accordingly, it can express the environment more accurately and richly, and also provide guidance for subsequent tasks.