I. Introduction
Human beings convey various sentiment everyday, which indicates the physical and mental health of people. Therefore, sentiment analysis has become an important direction in the research of artificial intelligence (AI), which aims to understand people's sentiment from their words or deeds. Sentiment analysis has been based on text modality for a long period [1]–[3]. More recently, with the development of social media, to detect sentiment more accurately and comprehensively, multimodal learning systems [4]–[8] that incorporates facial expressions and tones of the speaker become popular.