ACTUAL: Audio Captioning With Caption Feature Space Regularization | IEEE Journals & Magazine | IEEE Xplore

ACTUAL: Audio Captioning With Caption Feature Space Regularization


Abstract:

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio content, different people may perceive the same audio cl...Show More

Abstract:

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio content, different people may perceive the same audio clip differently, resulting in caption disparities (i.e., the same audio clip may be described by several captions with diverse semantics). In the literature, the one-to-many strategy is often employed to train the audio captioning models, where a related caption is randomly selected as the optimization target for each audio clip at each training iteration. However, we observe that this can lead to significant variations during the optimization process and adversely affect the performance of the model. In this article, we address this issue by proposing an audio captioning method, named ACTUAL (Audio Captioning with capTion featUre spAce reguLarization). ACTUAL involves a two-stage training process: (i) in the first stage, we use contrastive learning to construct a proxy feature space where the similarities between captions at the audio level are explored, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to improve the optimization of the model in a more stable direction. We conduct extensive experiments to demonstrate the effectiveness of the proposed ACTUAL method. The results show that proxy caption embedding can significantly improve the performance of the baseline model and the proposed ACTUAL method offers competitive performance on two datasets compared to state-of-the-art methods.
Page(s): 2643 - 2657
Date of Publication: 06 July 2023

ISSN Information:

Funding Agency:


I. Introduction

Audio captioning is a cross-modal translation task that requires extracting features from an audio clip and using a language model to describe the content of the audio clip based on these features [1], [2], [3], [4], [5]. However, unlike automatic speech recognition that transcribes speech to text [6], the audio captioning task focuses on identifying human-perceived information in general audio signals and expressing it with natural language. The generated caption may include the descriptions for sound events, acoustic scenes, and other high-level semantic information such as concepts, physical properties, and high-level knowledge [2].

Contact IEEE to Subscribe

References

References is not available for this document.