I. Introduction
Multimodal sentiment analysis (MSA) aims to predict emotional scores from audio, visual, and text features. MSA has been widely used and has become a popular topic of research. It has been widely applied in areas such as marketing management [1], [2], social media analysis [3], [4], and human-computer interaction [5], [6]. Although it is easy for humans to perceive the world through comprehensive information acquired via multiple sensory organs [7], the question of how to endow machines with analogous cognitive capabilities is still unresolved. One of the challenges we are facing is the heterogeneity gap in multimodal data [8]. This gap arises from the initial unequal subspaces of feature vectors extracted from different modalities, leading to completely different vector representations for semantically similar elements. This phenomenon poses a challenge to the comprehensive utilization of multimodal data by subsequent machine learning modules [9]. Researchers have made remarkable strides in the realm of designing multimodal feature fusion methods [10], [11], [12], [13], [14]. Nevertheless, limited consideration has been given to addressing the disparities in heterogeneity among multimodal features. Currently, there are two main methods in MSA: the first involves geometric operations performed on feature vectors to achieve feature fusion, while the second involves the use of a transformer (attention) to design complex feature fusion methods.