Journals & Magazines >IEEE Transactions on Multimedia >Volume: 25

Deep Multimodal Sequence Fusion by Regularized Expressive Representation Distillation

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Multimodal sequence learning aims to utilize information from different modalities to enhance overall performance. Mainstream works often follow an intermediate-fusion pi...Show More

Metadata

Abstract:

Multimodal sequence learning aims to utilize information from different modalities to enhance overall performance. Mainstream works often follow an intermediate-fusion pipeline, which explores both modality-specific and modality-supplementary information for fusion. However, the unaligned and heterogeneously distributed multimodal sequences pose significant challenges to the fusion task: 1) to extract both effective unimodal and crossmodal representations and 2) to overcome the overfitting issue in joint multimodal sequence optimization. In this work, we propose regularized expressive representation distillation (RERD) that aims to seek effective multimodal representations and to enhance the generalization of fusion. First, to improve unimodal representation learning, unimodal representations are assigned to multi-head distillation encoders, where the unimodal representations are iteratively updated through distillation attention layers. Second, to alleviate the overfitting issue in joint crossmodal optimization, a multimodal sinkhorn distance regularizer is proposed to reinforce the expressive representation extraction and to reduce the modality gap before fusion adaptively. These representations produce a comprehensive view of the multimodal sequences, which are utilized for downstream fusion tasks. Experimental results on several popular benchmarks demonstrate that the proposed method achieves state-of-the-art performance, compared with widely used baselines for deep multimodal sequence fusion, as shown in https://github.com/Redaimao/RERD.

Published in: IEEE Transactions on Multimedia ( Volume: 25)

Page(s): 2085 - 2096

Date of Publication: 13 January 2022

ISSN Information:

DOI: 10.1109/TMM.2022.3142448

Funding Agency:

Contents

I. Introduction

With the advance of modality representation learning in language [1]–[4], audio [5]–[7], and vision [8]–[11], multimodal sequence learning that aims to improve overall performance by fusing multiple sensory data, has drawn much attention recently. Multimodal sequence learning bridges the gaps between different modalities and is expected to provide a more reliable solution with high generalization ability when involving more modalities.

References is not available for this document.

Deep Multimodal Sequence Fusion by Regularized Expressive Representation Distillation

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Deep Multimodal Sequence Fusion by Regularized Expressive Representation Distillation

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References