I. INTRODUCTION
Due to its widespread application in Simultaneous Localization and Mapping (SLAM), 3D reconstruction, and autonomous exploration, precise and robust state estimation in GNSS-denied environments has become a hot research area. The combination of a monocular camera and an IMU represents the minimal solution for recovering six degrees of freedom in measurements. Given the low cost, low power consumption, and compact size of monocular cameras and IMUs, monocular visual-inertial odometry (VIO) has become a common solution for autonomous robotic navigation and positioning [1]. Existing VIO frameworks are relatively mature in stable environments. However, the unpredictable nature of disaster zones, characterized by extreme light distribution, dynamic illumination changes, and visual impediments such as dust, fog, and smoke, presents significant challenges [2]. Infrared cameras, which operate by sensing the long-wave infrared (LWIR) portion of the electromagnetic spectrum, are not affected by insufficient scene lighting or obstructions common in many environments [3]. Additionally, infrared cameras retain the advantages of visible-light cameras, such as being lightweight and energy-efficient. However, compared to visible-light images, infrared images typically contain less texture, have lower contrast, and are more noise-prone, making it difficult for existing visible-light VIO frameworks to adapt directly to infrared imagery.