EMOTION IS PRECIOUS, useful for many applications such as public emotion detection and psychological disease prediction. Emotion mining has attracted extensive attention. Many scholars employ multimodal information fusion to mine the potential relationship between low-dimensional data features and subjective emotions, which have achieved good results. With the development of machine learning technology such as deep learning, part of emotion recognition problems are solved, but there are still many challenges.
Emotion confusion is the main reasons for an emotion recognition error. Emotion confusion means that the emotion expressed in samples is often a mixture of multiple emotions instead of being independent.
The methods of multimodal feature fusion lack theoretical basis and guidance of emotion relationship in multimodal data. Multimodal data fusion mainly includes three approaches: early fusion, late-fusion, and intermediate-fusion. How to fuse the multimodal data and define the fusion weights of different modal data is also a problem that needs to be considered.
Large-scale datasets with high quality for emotion recognition are very rare. In classification tasks with absolutely correct labels, manually labeled datasets are usually of high quality. However, emotion is subjective. People with different cultural backgrounds and social experiences may carry different emotions on the same text or image. Multimodal datasets for emotion recognition need a large number of people to label them. This will reduce the recognition error caused by human bias as much as possible, but it also means the high cost of data collection and annotation.