I. Introduction
Multimodal Sentiment Analysis (MSA) has garnered significant attention in recent years. Unlike traditional emotion recognition approaches that rely solely on a single signal [1], MSA combines multiple signals to achieve a more comprehensive understanding of human emotions [2]. Research has shown that leveraging complementary semantic information from different signals improves the generation of accurate multimodal representations [3]. While substantial progress has been made in MSA under the assumption that all modalities are fully available during both training and inference [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], this ideal scenario rarely holds true in real-world applications. Modalities are frequently missing due to challenges such as security constraints, environmental noise, or hardware limitations. Such incomplete multimodal data significantly degrades the performance of MSA systems.