Loading [MathJax]/extensions/MathMenu.js
Multimodal Transformer Fusion for Sentiment Analysis using Audio, Text, and Visual Cues | IEEE Conference Publication | IEEE Xplore

Multimodal Transformer Fusion for Sentiment Analysis using Audio, Text, and Visual Cues


Abstract:

In modern era, the increased growth in social media platforms and technologies such as Artificial Intelligence (AI) have gained interest towards multimodal sentiment anal...Show More

Abstract:

In modern era, the increased growth in social media platforms and technologies such as Artificial Intelligence (AI) have gained interest towards multimodal sentiment analysis that includes text, audio and visual cues for extraction of useful insights. Although these systems are capable of analyzing sentiment, but also faces certain challenges in synchronization of multimodal inputs, data fusion and reliability under various contexts. Hence, this research proposes an effective Multimodal Transformer Fusion (MTF) which combines the strengths of various modalities to recognize human emotions. Initially, the data is collected from CMU-Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset. Further, the data is preprocessed with Empirical Mode Decomposition (EMD) and Min-max normalization, stop words removal, and lemmatization to remove noise and improve the quality of sentiment analysis. Then, the features are extracted for different input modalities by using Mel-Frequency Cepstral Coefficients (MFCC), 3D Residual Network (ResNet 3D) and Distilled Bidirectional Encoder Representations from Transformers (DistilBERT). After that, extracted features are further fed into proposed MTF to classify the sentiment of a customer into positive, negative or neutral. From the results, the proposed MTF achieved outstanding results in F1 Score (80.76%) as well as Mean Absolute Error (0.376) when compared with the existing Tensor Fusion Network (TFN).
Date of Conference: 24-25 January 2025
Date Added to IEEE Xplore: 27 March 2025
ISBN Information:
Conference Location: Bidar, India
No metrics found for this document.

I. Introduction

In the past few years, sentiment analysis and emotion recognition become more popular and active in research areas. By obtaining information from multiple sense modalities like facial expressions, speech signals, written language to measure the sentiment (positive, negative and neutral). In addition, it provides more comprehensive understanding of user sentiment of customer reviews, social media comments and surveys to identify emotions and opinions. The sentiment analysis had huge benefits and successfully employed in health issues, financial market prediction, and mostly in customer analytics [1]. However, human emotions are very complex and diverse, describing sentiment features was a big deal in traditional approaches. The transformers fuse data across multiple modalities was notable at three levels such as initial stage (input), intermediate representation, and prediction. Also, for special modalities had cross-task and cross-lingual gap as major obstacles to transfer because of their uniqueness, input-output workflows, and different reasoning. In transformer based multimodal learning, transferability was a major challenge involving by what method to transfer representations across various applications. The high diversity of tasks poses to universalness that miss uniqueness of modalities. Moreover, improving inferring efficiency was another challenging due to high dimension modality representations [2].

Usage
Select a Year
2025

View as

Total usage sinceMar 2025:3
01234JanFebMarAprMayJunJulAugSepOctNovDec003000000000
Year Total:3
Data is updated monthly. Usage includes PDF downloads and HTML views.
Contact IEEE to Subscribe

References

References is not available for this document.