I. Introduction
The underlying 3D structures of an indoor scene, such as a floor plan of a building, play a crucial role for a holistic robot perception as detailed in [1]–[6]. Despite their simplicity, these high-level geometry abstractions can complement challenging tasks such as obstacle avoidance [3], robot localization [4], path planning [5], and scene understanding [6]; hence, a handy and direct estimation of a floor plan geometry is our primary motivation. Current state-of-the-art solutions for floor plan estimation [7]–[9] rely on dense point clouds as input, which require active sensors (e.g., LiDAR, depth cameras) for data collection and pre-processing steps beforehand; hence, a direct estimation from a sequence of observation is not supported. On the other hand, approaches that rely only on imagery [10], [11] generally leverage structure-from-motion (SfM), multi-view stereo (MVS), and pixel semantic estimations to project geometric clues into 3D/2D space from where the floor plan is inferred. However, these solutions heavily depend on the estimated point cloud’s quality and sparsity. Additionally, they require the whole data in advance to successfully assess multiple rooms in the scene.