I. Introduction
The videos captured by a single sensor or under a single shooting setting can only describe the imaging scene from a limited perspective, which restricts algorithms from understanding the semantics. Some works enhance feature mapping by combining the multimodal complementary images obtained from various sensors or different shooting settings, and have proven their effectiveness. Among all the multimodal object detection scenarios, infrared and visible images are the two most commonly used signals. Infrared images capture thermal radiation emitted from objects, which can effectively highlight salient objects but lack texture details. In contrast, visible images usually contain rich detail information but are easily affected by the complex background to lose objects. These complementary features can facilitate algorithm learning to obtain desired results containing rich texture details and accurate semantic information. Owing to the excellent characteristics of the detection performance, infrared and visible object detection has been widely applied in video surveillance [1], vehicle inspection [2], [3], pedestrian detection [4], [5], autonomous driving [6], [7], and many other domains [8], [9].