1. INTRODUCTION
In recent years, deep generative models have achieved promising progress in the field of symbolic music generation [1]–[3]. In particular, music inpainting task [4]–[7] draws lots of research attention due to its great practical value in humancomputer music co-creation [8]. The general setting is that human composers create some parts of a piece, while the algorithm inpaints (or infills) the rest. However, long-term generation remains a challenging task. When the inpainting scope exceeds several beats, current methods cannot yet preserve a natural structure and the overall musicality.