I. Introduction
With the continuous development of sensor technology, a variety of different types of data are continuously collected by researchers. The single-modal data with insufficient information can no longer meet the high demand for the accuracy of each task. However, the multimodal fusion technology [1] can fuse different data about the same object from multiple sensors through a unified framework, thereby providing sufficient information for model training. In recent years, the fusion of visible and infrared images [2] is a research hotspot of multimodal fusion technology, which discriminatively characterizes the brightness and heat of objects based on visible and infrared so as to obtain more information about the object and background. Therefore, it has broad application prospects in the fields of object detection, video surveillance, location tracking, and so on [3], [4], [5].