Journals & Magazines >IEEE/ACM Transactions on Audi... >Volume: 31

ACTUAL: Audio Captioning With Caption Feature Space Regularization

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio content, different people may perceive the same audio cl...Show More

Metadata

Abstract:

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio content, different people may perceive the same audio clip differently, resulting in caption disparities (i.e., the same audio clip may be described by several captions with diverse semantics). In the literature, the one-to-many strategy is often employed to train the audio captioning models, where a related caption is randomly selected as the optimization target for each audio clip at each training iteration. However, we observe that this can lead to significant variations during the optimization process and adversely affect the performance of the model. In this article, we address this issue by proposing an audio captioning method, named ACTUAL (Audio Captioning with capTion featUre spAce reguLarization). ACTUAL involves a two-stage training process: (i) in the first stage, we use contrastive learning to construct a proxy feature space where the similarities between captions at the audio level are explored, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to improve the optimization of the model in a more stable direction. We conduct extensive experiments to demonstrate the effectiveness of the proposed ACTUAL method. The results show that proxy caption embedding can significantly improve the performance of the baseline model and the proposed ACTUAL method offers competitive performance on two datasets compared to state-of-the-art methods.

Published in: IEEE/ACM Transactions on Audio, Speech, and Language Processing ( Volume: 31)

Page(s): 2643 - 2657

Date of Publication: 06 July 2023

ISSN Information:

DOI: 10.1109/TASLP.2023.3293015

Funding Agency:

Yiming Zhang

Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China

Yiming Zhang received the B.E. degree from Beijing University of Posts and Telecommunications (BUPT), Beijing, China, in 2018 and the M.S. degree from Beijing University of Posts and Telecommunications BUPT in 2021. He is working toward the Ph.D. degree. His research interests include audio captioning and audio generation.

Hong Yu

Department of Artificial Intelligence, School of Information and Electrical Engineering, Ludong University, Yantai, Shandong, China

Hong Yu received the B.Sc. and M.Sc. degrees in electronic information engineering from Shandong University, Jinan, China, in 2003 and 2006, respectively and the Ph.D. degree in signal and information processing from the Beijing University of Posts and Telecommunications, Beijing, China, in 2018. Since 2006, he has been a Lecturer with Ludong University, Yantai, China. His research interests include pattern recognition an...Show More

Ruoyi Du

Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China

Ruoyi Du received the B.E. degree in telecommunication in management from Beijing University of Posts and Telecommunications, Beijing, China, in 2020. He is currently working toward the Ph.D. degree. His research interests include pattern recognition and computer vision.

Zheng-Hua Tan

Department of Electronic Systems, Aalborg University, Aalborg, Denmark

Wenwu Wang

Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, U.K.

Zhanyu Ma

Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China

Yuan Dong

Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China

Yuan Dong received the Ph.D. degree from the Shanghai Jiao Tong University, Shanghai, China, in 1999. Yuan Dong is currently a Professor with Beijing University of Posts and Telecommunications, Beijing, China. His research interests include text analysis, machine translation, and natural language processing.

Contents

I. Introduction

Audio captioning is a cross-modal translation task that requires extracting features from an audio clip and using a language model to describe the content of the audio clip based on these features [1], [2], [3], [4], [5]. However, unlike automatic speech recognition that transcribes speech to text [6], the audio captioning task focuses on identifying human-perceived information in general audio signals and expressing it with natural language. The generated caption may include the descriptions for sound events, acoustic scenes, and other high-level semantic information such as concepts, physical properties, and high-level knowledge [2].

Yiming Zhang

Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China

Hong Yu

Department of Artificial Intelligence, School of Information and Electrical Engineering, Ludong University, Yantai, Shandong, China

Ruoyi Du

Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China

Zheng-Hua Tan

Department of Electronic Systems, Aalborg University, Aalborg, Denmark

Zheng-Hua Tan (Senior Member, IEEE) is a Professor and a Co-head of the Centre for Acoustic Signal Processing Research (CASPR) with Aalborg University, Aalborg, Denmark. He was a Visiting Scientist with Massachusetts Institute of Technology, Cambridge, MA, USA, an Associate Professor with Shanghai Jiao Tong University, Shanghai, China, and a Postdoctoral Fellow with Korea Advanced Institute of Science Technology, Daejeon, South Korea. His research interests include speech and speaker recognition, noise-robust speech processing, multi- modal signal processing, social robotics, and machine learning. He has authored and coauthored more than 200 refereed publications. He was the Chair of the IEEE Signal Processing Society Machine Learning for Signal Processing Technical Committee (MLSP TC). He is an Associate Editor for the IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING. He was an Editorial Board Member for Computer Speech and Language and Guest Editor for the IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING and Neurocomputing. He was the General Chair of IEEE MLSP 2018 and a TPC Co-Chair of IEEE SLT 2016.

Wenwu Wang

Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, U.K.

Wenwu Wang (Senior Member, IEEE) was born in Anhui, China. He received the B.Sc. degree in 1997, the M.E. degree in 2000, and the Ph.D. degree in 2002, all from Harbin Engineering University, Harbin, China. He was with King's College London, Cardiff University, Tao Group Ltd. (now Antix Labs Ltd.), and Creative Labs, in May 2007 before joining University of Surrey, Guildford, U.K., where he is currently a Professor of signal processing and machine learning and a Co-Director of the Machine Audition Lab within the Centre for Vision Speech and Signal Processing. His research interests include signal processing, machine learning and perception, machine audition (listening), and statistical anomaly detection. He has Co -authored more than 300 publications in these areas. He is the Elected Chair of IEEE Signal Processing Society Machine Learning for Signal Processing Technical Committee. He is currently the Senior Area Editor of IEEE Transactions on Signal Processing and an Associate Editor for IEEE/ACM Transactions on Audio Speech and Language Processing.

Zhanyu Ma

Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China

Zhanyu Ma (Senior Member, IEEE) received the Ph.D. degree in electrical engineering from KTH Royal Institute of Technology, Stockholm, Sweden, in 2011. From 2012 to 2013, he was a Postdoctoral Research Fellow with the School of Electrical Engineering, KTH. Since 2019, he has been a Professor with the Beijing University of Posts and Telecommunications, Beijing, China. From 2014 to 2019, he was an Associate Professor with the Beijing University of Posts and Telecommunications, Beijing, China. His research interests include pattern recognition and machine learning fundamentals with a focus on applications in computer vision, multimedia signal processing.

Yuan Dong

Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China

References is not available for this document.

ACTUAL: Audio Captioning With Caption Feature Space Regularization

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

ACTUAL: Audio Captioning With Caption Feature Space Regularization

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?