Loading [MathJax]/extensions/MathMenu.js
ACTUAL: Audio Captioning With Caption Feature Space Regularization | IEEE Journals & Magazine | IEEE Xplore

ACTUAL: Audio Captioning With Caption Feature Space Regularization


Abstract:

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio content, different people may perceive the same audio cl...Show More

Abstract:

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio content, different people may perceive the same audio clip differently, resulting in caption disparities (i.e., the same audio clip may be described by several captions with diverse semantics). In the literature, the one-to-many strategy is often employed to train the audio captioning models, where a related caption is randomly selected as the optimization target for each audio clip at each training iteration. However, we observe that this can lead to significant variations during the optimization process and adversely affect the performance of the model. In this article, we address this issue by proposing an audio captioning method, named ACTUAL (Audio Captioning with capTion featUre spAce reguLarization). ACTUAL involves a two-stage training process: (i) in the first stage, we use contrastive learning to construct a proxy feature space where the similarities between captions at the audio level are explored, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to improve the optimization of the model in a more stable direction. We conduct extensive experiments to demonstrate the effectiveness of the proposed ACTUAL method. The results show that proxy caption embedding can significantly improve the performance of the baseline model and the proposed ACTUAL method offers competitive performance on two datasets compared to state-of-the-art methods.
Page(s): 2643 - 2657
Date of Publication: 06 July 2023

ISSN Information:

Funding Agency:

Author image of Yiming Zhang
Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China
Yiming Zhang received the B.E. degree from Beijing University of Posts and Telecommunications (BUPT), Beijing, China, in 2018 and the M.S. degree from Beijing University of Posts and Telecommunications BUPT in 2021. He is working toward the Ph.D. degree. His research interests include audio captioning and audio generation.
Yiming Zhang received the B.E. degree from Beijing University of Posts and Telecommunications (BUPT), Beijing, China, in 2018 and the M.S. degree from Beijing University of Posts and Telecommunications BUPT in 2021. He is working toward the Ph.D. degree. His research interests include audio captioning and audio generation.View more
Author image of Hong Yu
Department of Artificial Intelligence, School of Information and Electrical Engineering, Ludong University, Yantai, Shandong, China
Hong Yu received the B.Sc. and M.Sc. degrees in electronic information engineering from Shandong University, Jinan, China, in 2003 and 2006, respectively and the Ph.D. degree in signal and information processing from the Beijing University of Posts and Telecommunications, Beijing, China, in 2018. Since 2006, he has been a Lecturer with Ludong University, Yantai, China. His research interests include pattern recognition an...Show More
Hong Yu received the B.Sc. and M.Sc. degrees in electronic information engineering from Shandong University, Jinan, China, in 2003 and 2006, respectively and the Ph.D. degree in signal and information processing from the Beijing University of Posts and Telecommunications, Beijing, China, in 2018. Since 2006, he has been a Lecturer with Ludong University, Yantai, China. His research interests include pattern recognition an...View more
Author image of Ruoyi Du
Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China
Ruoyi Du received the B.E. degree in telecommunication in management from Beijing University of Posts and Telecommunications, Beijing, China, in 2020. He is currently working toward the Ph.D. degree. His research interests include pattern recognition and computer vision.
Ruoyi Du received the B.E. degree in telecommunication in management from Beijing University of Posts and Telecommunications, Beijing, China, in 2020. He is currently working toward the Ph.D. degree. His research interests include pattern recognition and computer vision.View more
Author image of Zheng-Hua Tan
Department of Electronic Systems, Aalborg University, Aalborg, Denmark
Zheng-Hua Tan (Senior Member, IEEE) is a Professor and a Co-head of the Centre for Acoustic Signal Processing Research (CASPR) with Aalborg University, Aalborg, Denmark. He was a Visiting Scientist with Massachusetts Institute of Technology, Cambridge, MA, USA, an Associate Professor with Shanghai Jiao Tong University, Shanghai, China, and a Postdoctoral Fellow with Korea Advanced Institute of Science Technology, Daejeon,...Show More
Zheng-Hua Tan (Senior Member, IEEE) is a Professor and a Co-head of the Centre for Acoustic Signal Processing Research (CASPR) with Aalborg University, Aalborg, Denmark. He was a Visiting Scientist with Massachusetts Institute of Technology, Cambridge, MA, USA, an Associate Professor with Shanghai Jiao Tong University, Shanghai, China, and a Postdoctoral Fellow with Korea Advanced Institute of Science Technology, Daejeon,...View more
Author image of Wenwu Wang
Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, U.K.
Wenwu Wang (Senior Member, IEEE) was born in Anhui, China. He received the B.Sc. degree in 1997, the M.E. degree in 2000, and the Ph.D. degree in 2002, all from Harbin Engineering University, Harbin, China. He was with King's College London, Cardiff University, Tao Group Ltd. (now Antix Labs Ltd.), and Creative Labs, in May 2007 before joining University of Surrey, Guildford, U.K., where he is currently a Professor of sig...Show More
Wenwu Wang (Senior Member, IEEE) was born in Anhui, China. He received the B.Sc. degree in 1997, the M.E. degree in 2000, and the Ph.D. degree in 2002, all from Harbin Engineering University, Harbin, China. He was with King's College London, Cardiff University, Tao Group Ltd. (now Antix Labs Ltd.), and Creative Labs, in May 2007 before joining University of Surrey, Guildford, U.K., where he is currently a Professor of sig...View more
Author image of Zhanyu Ma
Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma (Senior Member, IEEE) received the Ph.D. degree in electrical engineering from KTH Royal Institute of Technology, Stockholm, Sweden, in 2011. From 2012 to 2013, he was a Postdoctoral Research Fellow with the School of Electrical Engineering, KTH. Since 2019, he has been a Professor with the Beijing University of Posts and Telecommunications, Beijing, China. From 2014 to 2019, he was an Associate Professor with t...Show More
Zhanyu Ma (Senior Member, IEEE) received the Ph.D. degree in electrical engineering from KTH Royal Institute of Technology, Stockholm, Sweden, in 2011. From 2012 to 2013, he was a Postdoctoral Research Fellow with the School of Electrical Engineering, KTH. Since 2019, he has been a Professor with the Beijing University of Posts and Telecommunications, Beijing, China. From 2014 to 2019, he was an Associate Professor with t...View more
Author image of Yuan Dong
Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China
Yuan Dong received the Ph.D. degree from the Shanghai Jiao Tong University, Shanghai, China, in 1999. Yuan Dong is currently a Professor with Beijing University of Posts and Telecommunications, Beijing, China. His research interests include text analysis, machine translation, and natural language processing.
Yuan Dong received the Ph.D. degree from the Shanghai Jiao Tong University, Shanghai, China, in 1999. Yuan Dong is currently a Professor with Beijing University of Posts and Telecommunications, Beijing, China. His research interests include text analysis, machine translation, and natural language processing.View more

I. Introduction

Audio captioning is a cross-modal translation task that requires extracting features from an audio clip and using a language model to describe the content of the audio clip based on these features [1], [2], [3], [4], [5]. However, unlike automatic speech recognition that transcribes speech to text [6], the audio captioning task focuses on identifying human-perceived information in general audio signals and expressing it with natural language. The generated caption may include the descriptions for sound events, acoustic scenes, and other high-level semantic information such as concepts, physical properties, and high-level knowledge [2].

Author image of Yiming Zhang
Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China
Yiming Zhang received the B.E. degree from Beijing University of Posts and Telecommunications (BUPT), Beijing, China, in 2018 and the M.S. degree from Beijing University of Posts and Telecommunications BUPT in 2021. He is working toward the Ph.D. degree. His research interests include audio captioning and audio generation.
Yiming Zhang received the B.E. degree from Beijing University of Posts and Telecommunications (BUPT), Beijing, China, in 2018 and the M.S. degree from Beijing University of Posts and Telecommunications BUPT in 2021. He is working toward the Ph.D. degree. His research interests include audio captioning and audio generation.View more
Author image of Hong Yu
Department of Artificial Intelligence, School of Information and Electrical Engineering, Ludong University, Yantai, Shandong, China
Hong Yu received the B.Sc. and M.Sc. degrees in electronic information engineering from Shandong University, Jinan, China, in 2003 and 2006, respectively and the Ph.D. degree in signal and information processing from the Beijing University of Posts and Telecommunications, Beijing, China, in 2018. Since 2006, he has been a Lecturer with Ludong University, Yantai, China. His research interests include pattern recognition and machine learning fundamentals with a focus on applications in speech processing, image processing, data mining, biomedical signal processing, and bioinformatics.
Hong Yu received the B.Sc. and M.Sc. degrees in electronic information engineering from Shandong University, Jinan, China, in 2003 and 2006, respectively and the Ph.D. degree in signal and information processing from the Beijing University of Posts and Telecommunications, Beijing, China, in 2018. Since 2006, he has been a Lecturer with Ludong University, Yantai, China. His research interests include pattern recognition and machine learning fundamentals with a focus on applications in speech processing, image processing, data mining, biomedical signal processing, and bioinformatics.View more
Author image of Ruoyi Du
Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China
Ruoyi Du received the B.E. degree in telecommunication in management from Beijing University of Posts and Telecommunications, Beijing, China, in 2020. He is currently working toward the Ph.D. degree. His research interests include pattern recognition and computer vision.
Ruoyi Du received the B.E. degree in telecommunication in management from Beijing University of Posts and Telecommunications, Beijing, China, in 2020. He is currently working toward the Ph.D. degree. His research interests include pattern recognition and computer vision.View more
Author image of Zheng-Hua Tan
Department of Electronic Systems, Aalborg University, Aalborg, Denmark
Zheng-Hua Tan (Senior Member, IEEE) is a Professor and a Co-head of the Centre for Acoustic Signal Processing Research (CASPR) with Aalborg University, Aalborg, Denmark. He was a Visiting Scientist with Massachusetts Institute of Technology, Cambridge, MA, USA, an Associate Professor with Shanghai Jiao Tong University, Shanghai, China, and a Postdoctoral Fellow with Korea Advanced Institute of Science Technology, Daejeon, South Korea. His research interests include speech and speaker recognition, noise-robust speech processing, multi- modal signal processing, social robotics, and machine learning. He has authored and coauthored more than 200 refereed publications. He was the Chair of the IEEE Signal Processing Society Machine Learning for Signal Processing Technical Committee (MLSP TC). He is an Associate Editor for the IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING. He was an Editorial Board Member for Computer Speech and Language and Guest Editor for the IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING and Neurocomputing. He was the General Chair of IEEE MLSP 2018 and a TPC Co-Chair of IEEE SLT 2016.
Zheng-Hua Tan (Senior Member, IEEE) is a Professor and a Co-head of the Centre for Acoustic Signal Processing Research (CASPR) with Aalborg University, Aalborg, Denmark. He was a Visiting Scientist with Massachusetts Institute of Technology, Cambridge, MA, USA, an Associate Professor with Shanghai Jiao Tong University, Shanghai, China, and a Postdoctoral Fellow with Korea Advanced Institute of Science Technology, Daejeon, South Korea. His research interests include speech and speaker recognition, noise-robust speech processing, multi- modal signal processing, social robotics, and machine learning. He has authored and coauthored more than 200 refereed publications. He was the Chair of the IEEE Signal Processing Society Machine Learning for Signal Processing Technical Committee (MLSP TC). He is an Associate Editor for the IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING. He was an Editorial Board Member for Computer Speech and Language and Guest Editor for the IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING and Neurocomputing. He was the General Chair of IEEE MLSP 2018 and a TPC Co-Chair of IEEE SLT 2016.View more
Author image of Wenwu Wang
Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, U.K.
Wenwu Wang (Senior Member, IEEE) was born in Anhui, China. He received the B.Sc. degree in 1997, the M.E. degree in 2000, and the Ph.D. degree in 2002, all from Harbin Engineering University, Harbin, China. He was with King's College London, Cardiff University, Tao Group Ltd. (now Antix Labs Ltd.), and Creative Labs, in May 2007 before joining University of Surrey, Guildford, U.K., where he is currently a Professor of signal processing and machine learning and a Co-Director of the Machine Audition Lab within the Centre for Vision Speech and Signal Processing. His research interests include signal processing, machine learning and perception, machine audition (listening), and statistical anomaly detection. He has Co -authored more than 300 publications in these areas. He is the Elected Chair of IEEE Signal Processing Society Machine Learning for Signal Processing Technical Committee. He is currently the Senior Area Editor of IEEE Transactions on Signal Processing and an Associate Editor for IEEE/ACM Transactions on Audio Speech and Language Processing.
Wenwu Wang (Senior Member, IEEE) was born in Anhui, China. He received the B.Sc. degree in 1997, the M.E. degree in 2000, and the Ph.D. degree in 2002, all from Harbin Engineering University, Harbin, China. He was with King's College London, Cardiff University, Tao Group Ltd. (now Antix Labs Ltd.), and Creative Labs, in May 2007 before joining University of Surrey, Guildford, U.K., where he is currently a Professor of signal processing and machine learning and a Co-Director of the Machine Audition Lab within the Centre for Vision Speech and Signal Processing. His research interests include signal processing, machine learning and perception, machine audition (listening), and statistical anomaly detection. He has Co -authored more than 300 publications in these areas. He is the Elected Chair of IEEE Signal Processing Society Machine Learning for Signal Processing Technical Committee. He is currently the Senior Area Editor of IEEE Transactions on Signal Processing and an Associate Editor for IEEE/ACM Transactions on Audio Speech and Language Processing.View more
Author image of Zhanyu Ma
Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma (Senior Member, IEEE) received the Ph.D. degree in electrical engineering from KTH Royal Institute of Technology, Stockholm, Sweden, in 2011. From 2012 to 2013, he was a Postdoctoral Research Fellow with the School of Electrical Engineering, KTH. Since 2019, he has been a Professor with the Beijing University of Posts and Telecommunications, Beijing, China. From 2014 to 2019, he was an Associate Professor with the Beijing University of Posts and Telecommunications, Beijing, China. His research interests include pattern recognition and machine learning fundamentals with a focus on applications in computer vision, multimedia signal processing.
Zhanyu Ma (Senior Member, IEEE) received the Ph.D. degree in electrical engineering from KTH Royal Institute of Technology, Stockholm, Sweden, in 2011. From 2012 to 2013, he was a Postdoctoral Research Fellow with the School of Electrical Engineering, KTH. Since 2019, he has been a Professor with the Beijing University of Posts and Telecommunications, Beijing, China. From 2014 to 2019, he was an Associate Professor with the Beijing University of Posts and Telecommunications, Beijing, China. His research interests include pattern recognition and machine learning fundamentals with a focus on applications in computer vision, multimedia signal processing.View more
Author image of Yuan Dong
Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China
Yuan Dong received the Ph.D. degree from the Shanghai Jiao Tong University, Shanghai, China, in 1999. Yuan Dong is currently a Professor with Beijing University of Posts and Telecommunications, Beijing, China. His research interests include text analysis, machine translation, and natural language processing.
Yuan Dong received the Ph.D. degree from the Shanghai Jiao Tong University, Shanghai, China, in 1999. Yuan Dong is currently a Professor with Beijing University of Posts and Telecommunications, Beijing, China. His research interests include text analysis, machine translation, and natural language processing.View more
Contact IEEE to Subscribe

References

References is not available for this document.