Conferences >ICASSP 2020 - 2020 IEEE Inter...

Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data. By explicitly con...Show More

Metadata

Abstract:

Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data. By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice. Unlike other methods, we train Mellotron using only read speech data without alignments between text and audio. We evaluate our models using the LJSpeech and LibriTTS datasets. We provide F0 Frame Errors and synthesized samples that include style transfer from other speakers, singers and styles not seen during training, procedural manipulation of rhythm and pitch and choir synthesis.

Published in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 04-08 May 2020

Date Added to IEEE Xplore: 09 April 2020

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP40776.2020.9054556

Conference Location: Barcelona, Spain

Contents

1. INTRODUCTION

Speech synthesis is typically formulated as the conversion of text to speech (TTS). This formulation, however, leaves out control for all the aspects of speech not contained in the text. Here we approach the problem of expressive speech synthesis which includes not just text, but other characteristics such as pitch, rhythm and emphasis. There are formulations to expressive speech synthesis that require animated and emotive voice data. This is an inconvenient drawback given the limited access to such data. In our approach, we can make a voice emote and sing without any such data.

References is not available for this document.

Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

References