1. INTRODUCTION
The purpose of affective computing is to diagnose and measure emotional expression, and sentiment analysis has been the main task in recent years[1]. Traditional sentiment analysis usually relies on single modality[2], and in practical applications, it may encounter issues such as data sparsity or ambiguity bias[3]. In this case, multimodal data reflects the emotional expression from different aspects. Therefore, existing researches tend to learn the joint multimodal representation from complex contexts to improve performance.