1. Introduction
Text-to-speech (TTS) aims to produce natural human speech. In the past few years, deep learning based models have developed rapidly. Recent research shows that the quality and the naturalness of the synthesized voices are comparable with real human speech, such as Tacotron 2 [1], DeepVoice 3 [2], and FastSpeech 2 [3]. Despite the successful achievement of speaker-dependent TTS, how to create expressive and controllable speaking styles in multi-speaker task still needs more research. Besides, the models of voice cloning in unseen speaker circumstances by using a speaker encoder usually tend to synthesize neutral and poor similarity voices compared to the real speaker. Therefore, how to sufficiently extract speaker information from the reference voices becomes significant.