I. Introduction
In the past few years, sentiment analysis and emotion recognition become more popular and active in research areas. By obtaining information from multiple sense modalities like facial expressions, speech signals, written language to measure the sentiment (positive, negative and neutral). In addition, it provides more comprehensive understanding of user sentiment of customer reviews, social media comments and surveys to identify emotions and opinions. The sentiment analysis had huge benefits and successfully employed in health issues, financial market prediction, and mostly in customer analytics [1]. However, human emotions are very complex and diverse, describing sentiment features was a big deal in traditional approaches. The transformers fuse data across multiple modalities was notable at three levels such as initial stage (input), intermediate representation, and prediction. Also, for special modalities had cross-task and cross-lingual gap as major obstacles to transfer because of their uniqueness, input-output workflows, and different reasoning. In transformer based multimodal learning, transferability was a major challenge involving by what method to transfer representations across various applications. The high diversity of tasks poses to universalness that miss uniqueness of modalities. Moreover, improving inferring efficiency was another challenging due to high dimension modality representations [2].