Loading [MathJax]/extensions/MathMenu.js
Fusion and Discrimination: A Multimodal Graph Contrastive Learning Framework for Multimodal Sarcasm Detection | IEEE Journals & Magazine | IEEE Xplore

Fusion and Discrimination: A Multimodal Graph Contrastive Learning Framework for Multimodal Sarcasm Detection


Abstract:

Identifying sarcastic clues from both textual and visual information has become an important research issue, called Multimodal Sarcasm Detection. In this article, we inve...Show More

Abstract:

Identifying sarcastic clues from both textual and visual information has become an important research issue, called Multimodal Sarcasm Detection. In this article, we investigate multimodal sarcasm detection from a novel perspective, where a multimodal graph contrastive learning strategy is proposed to fuse and distinguish the sarcastic clues for textual modality and visual modality. Specifically, we first utilize object detection to derive the crucial visual regions accompanied by their captions of the images, which allows better learning of the key visual regions of visual modality. In addition, to make full use of the semantic information of the visual modality, we employ optical character recognition to extract the textual content in the images. Then, based on image regions, the textual content of visual modality, and the context of the textual modality, we build a multimodal graph for each sample to model the intricate sarcastic relations between modalities. Furthermore, we devise a graph-oriented contrastive learning strategy to leverage the correlations in the same label and differences between different labels, so as to capture better multimodal representations for multimodal sarcasm detection. Extensive experiments show that our method outperforms the previous best baseline models (with a 2.47% improvement in Accuracy, a 1.99% improvement in F-score, and a 2.20% improvement in Macro F-score). The ablation study shows that both multimodal graph structure and graph-oriented contrastive learning are important to our framework. Further, the experiments of using different pre-trained methods show that the proposed multimodal graph contrastive learning framework can directly work with various pre-trained models and achieve outstanding performance in multimodal sarcasm detection.
Published in: IEEE Transactions on Affective Computing ( Volume: 15, Issue: 4, Oct.-Dec. 2024)
Page(s): 1874 - 1888
Date of Publication: 21 March 2024

ISSN Information:

Funding Agency:


I. Introduction

Sarcasm is a peculiar form and sophisticated linguistic phenomenon of language behavior, where people express ironic sentiment or intention that is opposite to the authentic/apparent intention [1], [2], [3], [4]. The Oxford English Dictionary defines sarcasm as “a sharp, bitter, or cutting expression or remark; a bitter gibe or taunt.”. While nowadays sarcasm is more generally used to mean a statement when people “say the opposite of the truth, or the opposite of their true feelings in order to be funny or to make a point”, as defined on the BBC sarcasm webpage

[Online]. Available: http://www.bbc.co.uk/worldservice/

[5]. Sarcasm is popular on social media platforms, which is closely related to irony—it is a form of irony that occurs when there is some discrepancy between the literal and intended meanings of an utterance [6], [7]. The figurative nature of sarcasm makes it an often-quoted challenge for sentiment analysis [8]. Since sarcasm is generally characterized as ironic or satirical wit that is intended to insult, mock, or amuse. Therefore, sarcasm can be manifested in many different ways, but recognizing sarcasm is important for natural language processing to avoid misinterpreting sarcastic statements as literal [6], [9]. Sarcasm may carry a positive surface sentiment but a negative implied sentiment (for example, “Visiting dentists is so much fun!”), a negative surface sentiment but positive implied sentiment (for example, “His performance in Olympics has been terrible anyway” as a response to the criticism of an Olympic medalist), or no surface sentiment (for example, the idiomatic expression “and I am the Queen of England” is used to express sarcasm) [5], [6]. Since sarcasm carries sentiment in some cases, detecting the sarcastic expression is a crucial strategy to improve the performance of sentiment analysis and opinion mining.

Contact IEEE to Subscribe

References

References is not available for this document.