Conferences >2022 13th International Sympo...

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

In recent years, neural network based methods for multispeaker text-to-speech synthesis (TTS) have made significant progress. However, the current speaker encoder models ...Show More

Metadata

Abstract:

In recent years, neural network based methods for multispeaker text-to-speech synthesis (TTS) have made significant progress. However, the current speaker encoder models used in these methods still cannot capture enough speaker information. In this paper, we focus on accurate speaker encoder modeling and propose an end-to-end method that can generate better similarity for both seen and unseen speakers. The proposed architecture consists of three separately trained components: a speaker encoder based on the state-of-the-art ECAPA-TDNN model which is derived from speaker verification task, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder. The comparison among different speaker encoder models shows our proposed method can achieve better speaker similarity. To efficiently evaluate our synthesized speech, we are the first to adopt and evaluate different deep learning based automatic MOS evaluation methods to assess our results, and these methods show great potential in automatic speech quality assessment.

Published in: 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)

Date of Conference: 11-14 December 2022

Date Added to IEEE Xplore: 08 February 2023

ISBN Information:

DOI: 10.1109/ISCSLP57327.2022.10037956

Conference Location: Singapore, Singapore

Funding Agency:

Contents

1. Introduction

Text-to-speech (TTS) aims to produce natural human speech. In the past few years, deep learning based models have developed rapidly. Recent research shows that the quality and the naturalness of the synthesized voices are comparable with real human speech, such as Tacotron 2 [1], DeepVoice 3 [2], and FastSpeech 2 [3]. Despite the successful achievement of speaker-dependent TTS, how to create expressive and controllable speaking styles in multi-speaker task still needs more research. Besides, the models of voice cloning in unseen speaker circumstances by using a speaker encoder usually tend to synthesize neutral and poor similarity voices compared to the real speaker. Therefore, how to sufficiently extract speaker information from the reference voices becomes significant.

References is not available for this document.

MIT Libraries

MIT Libraries

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Abstract:

Metadata

Abstract:

Funding Agency:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

1. Introduction

References