I. Introduction
Audio captioning is a cross-modal translation task that requires extracting features from an audio clip and using a language model to describe the content of the audio clip based on these features [1], [2], [3], [4], [5]. However, unlike automatic speech recognition that transcribes speech to text [6], the audio captioning task focuses on identifying human-perceived information in general audio signals and expressing it with natural language. The generated caption may include the descriptions for sound events, acoustic scenes, and other high-level semantic information such as concepts, physical properties, and high-level knowledge [2].