I. Introduction
Text-to-speech (TTS) conversion is a very important human-computer interaction task, aiming at synthesizing understandable and fluent speech indistinguishable from human recordings, which has attracted wide attention in the field of machine learning. Previous speech synthesis systems consisted of three basic components: a text analysis module to convert text sequences into linguistic features, an acoustic model to generate acoustic features based on the linguistic features, and then a vocoder to synthesize waveforms based on the acoustic features. With the development of end-to-end technology, direct input of text or annotated characters and direct modeling of speech through text or text features and speech reduce the reliance on text analysis and vocoder. Applying the end-to-end model to languages such as Chinese and English, which have a large number of speakers and a wide range of applicability, can generate more natural and expressive speech.