Journals & Magazines >IEEE Signal Processing Letters >Volume: 31

Variable-Length Speaker Conditioning in Flow-Based Text-to-Speech

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

In this letter, we propose a novel speaker conditioning technique that leverages a variable-length reference embedding sequence for flow-based text-to-speech (TTS) archit...Show More

Metadata

Abstract:

In this letter, we propose a novel speaker conditioning technique that leverages a variable-length reference embedding sequence for flow-based text-to-speech (TTS) architecture in the context of zero-shot multi-speaker text-to-speech (ZSM-TTS). Unlike conventional ZSM-TTS methods, which usually rely on a single fixed-dimensional vector to represent the entire reference speech, our approach aims to extract variable-length embedding sequence for a more flexible and efficient conditioning. We enhance the current affine coupling function in flow-based TTS architecture by introducing an attentive speaker conditioning. This allows a local variation of the speaker conditioning. Our experiments demonstrate the effectiveness of the proposed method, highlighting improvements in terms of speaker similarity, speech naturalness, and speech intelligibility compared to the baseline methods.

Published in: IEEE Signal Processing Letters ( Volume: 31)

Page(s): 899 - 903

Date of Publication: 18 March 2024

ISSN Information:

DOI: 10.1109/LSP.2024.3377588

Funding Agency:

Contents

I. Introduction

Speech synthesis has made remarkable strides in recent years, thanks to advancements in deep learning and neural networks. Today, we can generate synthetic speech that is often indistinguishable from human speech. However, despite these achievements, there remains a challenging research problem: cloning the voice of a speaker who was unseen during training. This issue poses a significant challenge for text-to-speech (TTS) models, as accurately mimicking a specific speaker's voice requires tuning speech factors, including their timbre, accent, and unique characteristics, meticulously.

References is not available for this document.

Variable-Length Speaker Conditioning in Flow-Based Text-to-Speech

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Variable-Length Speaker Conditioning in Flow-Based Text-to-Speech

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

Authors

Figures

References

Keywords

Metrics

Footnotes

References