I. Introduction
Dance is a performing art in which the dancer performs a series of motions to the music. It has important significance and implies rich feelings. With the growth of Artificial Intelligence (AI) technology, the computer-generated arts have attracted more and more researchers' interest [1], [2]. Deep cross-modal audio-visual learning has become a popular topic [3]. We focus on the cross-modal music-driven dance synthesis task that aims to generate dance motions conditioned on music as input. It plays a significant role in virtual reality, games, entertainment, choreography, etc. The music-driven dance synthesis is a challenging task. A good dance is not only consistent with the rhythm and style of the music but also natural and elegant. Moreover, dance is a creative performance art. People often have different feelings when dancing with different music, and even under the same music different performers move differently. The fact that Generative Adversarial Networks (GANs) can automatically learn to generate new examples using a generator and a discriminator has been shown to be effective in the cross-model generation tasks [2], [4], [5]. Recently, it has been successfully applied to address limited creativity in the music-driven dance synthesis [6]–[10]. To capture the coherence of dance motions and keep consistency between the generated dance motions and music, Lee et al. [6] and Ren et al. [8] design a global content discriminator for the alignment in latent space encoded from dance motions and music. The reconstruction loss with distance is also often combined with the adversarial loss as the loss function [6], [8], [9]. However, loss, as pointed out in [9], [11], will bring about the generated dance motion to be too restrictive and conservative in practice, which balances the errors of big-scale body motions (body-scale motions) and small-scale joint motions (joint-scale motions). Moreover, we observe that the changes of different-scale joints are different under different movements, as illustrated in Fig. 1. For the hand lift movements in the left column, only the joint-scale joints move partially. For the turning movements in the right column, the body-scale joints are changed while the joint-scale joints are basically unchanged. Additionally, the extracted dance motion joints are noisy in the dataset derived from the pose extractor.
Examples of the joints' changes at different scales for different movements. The upper line is body-scale motions and the lower line is jointscale motions.