Journals & Magazines >IEEE/ACM Transactions on Audi... >Volume: 31

ACTUAL: Audio Captioning With Caption Feature Space Regularization

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio content, different people may perceive the same audio cl...Show More

Metadata

Abstract:

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio content, different people may perceive the same audio clip differently, resulting in caption disparities (i.e., the same audio clip may be described by several captions with diverse semantics). In the literature, the one-to-many strategy is often employed to train the audio captioning models, where a related caption is randomly selected as the optimization target for each audio clip at each training iteration. However, we observe that this can lead to significant variations during the optimization process and adversely affect the performance of the model. In this article, we address this issue by proposing an audio captioning method, named ACTUAL (Audio Captioning with capTion featUre spAce reguLarization). ACTUAL involves a two-stage training process: (i) in the first stage, we use contrastive learning to construct a proxy feature space where the similarities between captions at the audio level are explored, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to improve the optimization of the model in a more stable direction. We conduct extensive experiments to demonstrate the effectiveness of the proposed ACTUAL method. The results show that proxy caption embedding can significantly improve the performance of the baseline model and the proposed ACTUAL method offers competitive performance on two datasets compared to state-of-the-art methods.

Published in: IEEE/ACM Transactions on Audio, Speech, and Language Processing ( Volume: 31)

Page(s): 2643 - 2657

Date of Publication: 06 July 2023

ISSN Information:

DOI: 10.1109/TASLP.2023.3293015

Funding Agency:

Contents

I. Introduction

Audio captioning is a cross-modal translation task that requires extracting features from an audio clip and using a language model to describe the content of the audio clip based on these features [1], [2], [3], [4], [5]. However, unlike automatic speech recognition that transcribes speech to text [6], the audio captioning task focuses on identifying human-perceived information in general audio signals and expressing it with natural language. The generated caption may include the descriptions for sound events, acoustic scenes, and other high-level semantic information such as concepts, physical properties, and high-level knowledge [2].

References is not available for this document.

MIT Libraries

MIT Libraries

ACTUAL: Audio Captioning With Caption Feature Space Regularization

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

ACTUAL: Audio Captioning With Caption Feature Space Regularization

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?