1. INTRODUCTION
For the task of music generation, the acquisition of substantial music data accompanied by captions is essential. However, most of the music datasets with accompanied captions are closed source data due to license restrictions [1], [2], [3]. Presently, the largest publicly available dataset catering to this need is MusicCaps [4], comprising approximately 28.52 hours of music accompanied by captions. In comparison to other datasets available for tasks like audio classification or audio tagging, MusicCaps is relatively small. Therefore, there is an urgent requirement for developing a methodology that can generate text-music pairs on a large scale for public use.