Loading [MathJax]/extensions/MathZoom.js
Multi-Granularity Aggregation Transformer for Joint Video-Audio-Text Representation Learning | IEEE Journals & Magazine | IEEE Xplore

Multi-Granularity Aggregation Transformer for Joint Video-Audio-Text Representation Learning


Abstract:

Many real-world video-text tasks involve different levels of granularity to represent local and global information with distinct semantics, such as frames and words, clip...Show More

Abstract:

Many real-world video-text tasks involve different levels of granularity to represent local and global information with distinct semantics, such as frames and words, clips and sentences, or videos and paragraphs. Most existing multimodal representation learning methods suffer from limitations: (i) Adopting expert systems or manual design to extract more fine-grained local information (such as objects and actions in a video frame) for supervision may lead to information asymmetry since there may no corresponding information among modalities; (ii) Neglecting the hierarchical nature of the data to aggregate different levels of information from different modalities will cause insufficient representations. To alleviate the above issues, in this paper, we propose a Multi-Granularity Aggregation Transformer (MGAT) for joint video-audio-text representation learning. Specifically, for intra-modality, we first design a multi-granularity transformer module to relieve information asymmetry by making full use of local and global information within a single modality from different perspectives. Then, for inter-modality, we develop an attention-guided aggregation module to fuse audio and video information hierarchically. Last, we align the aggregated information with text information at different hierarchical levels via intra- and inter-modality consistency loss and contrastive loss. With the help of more granularity of information, we are able to obtain a well-performed representation model for a variety of tasks, e.g., video-paragraph retrieval and video captioning. Extensive experiments on two challenging benchmarks, i.e., ActivityNet-captions and Youcook2, demonstrate the superiority of our proposed method.
Page(s): 2990 - 3002
Date of Publication: 30 November 2022

ISSN Information:

Funding Agency:


I. Introduction

Multimodal representation learning aims to learn representations of data by capturing correlations among various modalities, which has been applied in many different tasks, e.g., visual grounding [1], [2], [3], video retrieval [4], [5], [6], [7], [8], video summarization [9], action recognition [10], [11], [12], image captions [13], [14], [15], [16], [17] and etc. Many real-world modality data (such as video, text and audio) involve different levels of granularity, such as frames and words, clips and sentences or videos and paragraphs. Different granularities of data can represent local or global information with distinct semantics. Moreover, they are characterized by different statistical properties and have a heterogeneity gap among different modalities. Intuitively, a video frame can be represented as pixel intensities, while the text information can be represented as the outputs of feature extractors. Because of the heterogeneity gap among different modalities, it is critical for representation learning to discover the correlations among different modalities. To this end, many multimodal representation learning methods [5], [6], [18], [19] model the interactions between different levels of granularity and different modalities to improve the performance. However, multimodal representation learning still encounters two challenges.

Contact IEEE to Subscribe

References

References is not available for this document.