I. Introduction
We perceive and communicate with the world through a multisensory human system, by seeing objects, hearing sounds, speaking languages, as well as writing and reading texts. The information from these varied sources is processed by different parts of the human brain, as indicated by [5], [20], [252]. For instance, the occipital lobe acts as the primary center for visual processing, interpreting the distance and locations of objects, while the temporal lobe processes auditory information, helping us understand sounds. Language comprehension, facilitated by Wernicke's area in the posterior superior temporal lobe, is crucial for decoding both written and spoken words. Other sensory information, such as touch and movement, is processed by distinct brain areas. These integrated yet distinct functions form a complex and harmonious human sensing system. The specialized divisions in human neural processing, which highlight both unique and shared characteristics across different modalities, inspire us to think about the multimodal machine learning problem in the light of data in this paper.