I. Introduction
Visual-inertial (VI) sensor fusion is an active research field in robotics. Cameras and inertial sensors are complementary [1], and a combination of both provides reliable and accurate state estimation. While the majority of the research on VI fusion focuses on filter-based methods [2]–[4], nonlinear optimization has become increasingly popular within the last few years. Compared with filter-based methods, nonlinear optimization based methods suffer less from the accumulation of linearization errors. Their main drawback, high computational cost, has been mitigated by the advance of both hardware and theory [5] , [6]. Recent work [5] –[9] has shown impressive real-time VI state estimation results in challenging environments using nonlinear optimization.