I. Introduction
Multimodal learning has attracted an increasing interest in recent years. As computer vision and language models are advancing rapidly, multimodal models start to provide breakthrough improvements, which unlocks new applications that naturally involve two or more modalities. From the healthcare field where fusing medical images and electronic health records shows improvements in performances when compared to models that used only single modalities [1], [2], to autonomous driving where intelligent systems are built to process various signals in different modalities [3], [4].