I. Introduction
Multimodal representation learning aims to learn representations of data by capturing correlations among various modalities, which has been applied in many different tasks, e.g., visual grounding [1], [2], [3], video retrieval [4], [5], [6], [7], [8], video summarization [9], action recognition [10], [11], [12], image captions [13], [14], [15], [16], [17] and etc. Many real-world modality data (such as video, text and audio) involve different levels of granularity, such as frames and words, clips and sentences or videos and paragraphs. Different granularities of data can represent local or global information with distinct semantics. Moreover, they are characterized by different statistical properties and have a heterogeneity gap among different modalities. Intuitively, a video frame can be represented as pixel intensities, while the text information can be represented as the outputs of feature extractors. Because of the heterogeneity gap among different modalities, it is critical for representation learning to discover the correlations among different modalities. To this end, many multimodal representation learning methods [5], [6], [18], [19] model the interactions between different levels of granularity and different modalities to improve the performance. However, multimodal representation learning still encounters two challenges.