I. Introduction
With the emergence of multiple types of sensors, humans can use scene information captured from multiple angles to realize real-time management of the scene. Image fusion technology has been gaining traction to exploit this advantage better. Specifically, this process consolidates the information captured by different sensors of the same scene. As a result, the fused image highlights the unique advantages of each source image, enhancing the utilization of scene information and eliminating conflicts and redundancies among multiple sensors. Thus, it can provide data support for high-level vision tasks to contribute to their performance [1], [2], [3], [4], [5], [6], for example, tracking, object detection, and semantic segmentation. Generally, image fusion is divided into digital photography image fusion and multimodal image fusion (MMIF), depending on the image imaging method [7]. Digital photography image fusion primarily addresses the problem that the captured scene images have imperfect exposure or cannot be all-in-focus [8], [9], [10], [11]. It fused the under-quality images captured by different camera settings and produced a properly exposed and fully focused digital photographic image. MMIF comprises two parts: infrared and visible image fusion (IVIF) and medical image fusion (MIF) [3], [5], [12], [13]. Its primary objective is to generate a more comprehensive description of scenes by integrating the scene representations obtained from different types of sensors. This capability has garnered significant attention in various application areas, including surveillance, autonomous driving, and medical diagnosis [3], [14], [15], [16]. Especially, the fusion image generated by IVIF can break through the influence of extreme weather, having significant advantages for subsequent intelligent processing, which has gained wide application in livelihood and military fields [1], [12], [17].