I. Introduction
Sentiment analysis has attracted lots of attention [1]–[4]. Previous researchers mainly focused on building models to analyze text data [5]. But not all text messages can reflect the speaker’s sentiment clearly. Sometimes we can not judge the speaker’s feelings from the sentence ‘I go to work by bus’. Nowadays, multimodal information has been widely used in research. ESE-FN [6] fuses multimodal features to deal with elderly activity recognition. Multimodal interaction framework is applied to study Image-text retrieval work [7]. In NLP fields, multimodal fusion strategies are utilized both in sentiment analysis and dialog system [8].