I. Introduction
Visual-Inertial navigation systems (VINS), which fuse the measurements from visual and inertial sensors, have gained significant attentions in the field of robotics [1]–[18]. Unlike the vision-only systems, the monocular VINS can recover the metric scale in addition to providing a more accurate and high-bandwidth estimation of the translational and rotational velocities. Accordingly, the monocular VINS has been widely applied in various robotic systems such as ground vehicles and aerial robots [19]–[22]. However, it is extremely challenging to design a real-time, accurate, and robust VINS due to the fact that existing frameworks have to make a compromise between the accuracy and robustness [6]–[8], [11], [23], which limits their applications in challenging environments. Thus, a real-time, accurate, and robust VINS is still expected to be further explored.