I. Introduction
RGB-based tracking, as the main research branch of visual tracking, has developed greatly in recent years and achieved excellent performance in many different benchmarks. However, RGB-only tracking may struggle in some complicated scenes, such as extreme illumination and occlusion. This issue limits its applications in related fields that require high tracking robustness. Multimodal fusion has received considerable attention in visual perception fields such as segmentation [4], [5], [6], [7], detection [8] and image restoration [9]. In tracking filed, multimodal fusion obtains more valuable information from auxiliary modalities, achieving complementary and comprehensive information extraction and integration for robust tracking.