Conferences >2025 International Conference...

Multimodal Transformer Fusion for Sentiment Analysis using Audio, Text, and Visual Cues

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

In modern era, the increased growth in social media platforms and technologies such as Artificial Intelligence (AI) have gained interest towards multimodal sentiment anal...Show More

Metadata

Abstract:

In modern era, the increased growth in social media platforms and technologies such as Artificial Intelligence (AI) have gained interest towards multimodal sentiment analysis that includes text, audio and visual cues for extraction of useful insights. Although these systems are capable of analyzing sentiment, but also faces certain challenges in synchronization of multimodal inputs, data fusion and reliability under various contexts. Hence, this research proposes an effective Multimodal Transformer Fusion (MTF) which combines the strengths of various modalities to recognize human emotions. Initially, the data is collected from CMU-Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset. Further, the data is preprocessed with Empirical Mode Decomposition (EMD) and Min-max normalization, stop words removal, and lemmatization to remove noise and improve the quality of sentiment analysis. Then, the features are extracted for different input modalities by using Mel-Frequency Cepstral Coefficients (MFCC), 3D Residual Network (ResNet 3D) and Distilled Bidirectional Encoder Representations from Transformers (DistilBERT). After that, extracted features are further fed into proposed MTF to classify the sentiment of a customer into positive, negative or neutral. From the results, the proposed MTF achieved outstanding results in F1 Score (80.76%) as well as Mean Absolute Error (0.376) when compared with the existing Tensor Fusion Network (TFN).

Published in: 2025 International Conference on Intelligent Systems and Computational Networks (ICISCN)

Date of Conference: 24-25 January 2025

Date Added to IEEE Xplore: 27 March 2025

ISBN Information:

DOI: 10.1109/ICISCN64258.2025.10934189

Conference Location: Bidar, India

No metrics found for this document.

Contents

I. Introduction

In the past few years, sentiment analysis and emotion recognition become more popular and active in research areas. By obtaining information from multiple sense modalities like facial expressions, speech signals, written language to measure the sentiment (positive, negative and neutral). In addition, it provides more comprehensive understanding of user sentiment of customer reviews, social media comments and surveys to identify emotions and opinions. The sentiment analysis had huge benefits and successfully employed in health issues, financial market prediction, and mostly in customer analytics [1]. However, human emotions are very complex and diverse, describing sentiment features was a big deal in traditional approaches. The transformers fuse data across multiple modalities was notable at three levels such as initial stage (input), intermediate representation, and prediction. Also, for special modalities had cross-task and cross-lingual gap as major obstacles to transfer because of their uniqueness, input-output workflows, and different reasoning. In transformer based multimodal learning, transferability was a major challenge involving by what method to transfer representations across various applications. The high diversity of tasks poses to universalness that miss uniqueness of modalities. Moreover, improving inferring efficiency was another challenging due to high dimension modality representations [2].

Usage

Select a Year

View as

Total usage sinceMar 2025:3

Year Total:3

Data is updated monthly. Usage includes PDF downloads and HTML views.

Citations

Search for
Citations in
Google Scholar^®

References is not available for this document.

Multimodal Transformer Fusion for Sentiment Analysis using Audio, Text, and Visual Cues

Abstract:

Metadata

Abstract:

I. Introduction

View as

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Multimodal Transformer Fusion for Sentiment Analysis using Audio, Text, and Visual Cues

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

Authors

Figures

References

Keywords

Metrics

View as

References

IEEE Account

Purchase Details

Profile Information

Need Help?