Journals & Magazines >IEEE Transactions on Circuits... >Volume: 33 Issue: 6

Multi-Granularity Aggregation Transformer for Joint Video-Audio-Text Representation Learning

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Many real-world video-text tasks involve different levels of granularity to represent local and global information with distinct semantics, such as frames and words, clip...Show More

Metadata

Abstract:

Many real-world video-text tasks involve different levels of granularity to represent local and global information with distinct semantics, such as frames and words, clips and sentences, or videos and paragraphs. Most existing multimodal representation learning methods suffer from limitations: (i) Adopting expert systems or manual design to extract more fine-grained local information (such as objects and actions in a video frame) for supervision may lead to information asymmetry since there may no corresponding information among modalities; (ii) Neglecting the hierarchical nature of the data to aggregate different levels of information from different modalities will cause insufficient representations. To alleviate the above issues, in this paper, we propose a Multi-Granularity Aggregation Transformer (MGAT) for joint video-audio-text representation learning. Specifically, for intra-modality, we first design a multi-granularity transformer module to relieve information asymmetry by making full use of local and global information within a single modality from different perspectives. Then, for inter-modality, we develop an attention-guided aggregation module to fuse audio and video information hierarchically. Last, we align the aggregated information with text information at different hierarchical levels via intra- and inter-modality consistency loss and contrastive loss. With the help of more granularity of information, we are able to obtain a well-performed representation model for a variety of tasks, e.g., video-paragraph retrieval and video captioning. Extensive experiments on two challenging benchmarks, i.e., ActivityNet-captions and Youcook2, demonstrate the superiority of our proposed method.

Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Volume: 33, Issue: 6, June 2023)

Page(s): 2990 - 3002

Date of Publication: 30 November 2022

ISSN Information:

DOI: 10.1109/TCSVT.2022.3225549

Funding Agency:

Contents

I. Introduction

Multimodal representation learning aims to learn representations of data by capturing correlations among various modalities, which has been applied in many different tasks, e.g., visual grounding [1], [2], [3], video retrieval [4], [5], [6], [7], [8], video summarization [9], action recognition [10], [11], [12], image captions [13], [14], [15], [16], [17] and etc. Many real-world modality data (such as video, text and audio) involve different levels of granularity, such as frames and words, clips and sentences or videos and paragraphs. Different granularities of data can represent local or global information with distinct semantics. Moreover, they are characterized by different statistical properties and have a heterogeneity gap among different modalities. Intuitively, a video frame can be represented as pixel intensities, while the text information can be represented as the outputs of feature extractors. Because of the heterogeneity gap among different modalities, it is critical for representation learning to discover the correlations among different modalities. To this end, many multimodal representation learning methods [5], [6], [18], [19] model the interactions between different levels of granularity and different modalities to improve the performance. However, multimodal representation learning still encounters two challenges.

References is not available for this document.

Multi-Granularity Aggregation Transformer for Joint Video-Audio-Text Representation Learning

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Multi-Granularity Aggregation Transformer for Joint Video-Audio-Text Representation Learning

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

References

IEEE Account

Purchase Details

Profile Information

Need Help?