Variable-Length Speaker Conditioning in Flow-Based Text-to-Speech | IEEE Journals & Magazine | IEEE Xplore

Variable-Length Speaker Conditioning in Flow-Based Text-to-Speech

Publisher: IEEE

Abstract:

In this letter, we propose a novel speaker conditioning technique that leverages a variable-length reference embedding sequence for flow-based text-to-speech (TTS) archit...View more

Abstract:

In this letter, we propose a novel speaker conditioning technique that leverages a variable-length reference embedding sequence for flow-based text-to-speech (TTS) architecture in the context of zero-shot multi-speaker text-to-speech (ZSM-TTS). Unlike conventional ZSM-TTS methods, which usually rely on a single fixed-dimensional vector to represent the entire reference speech, our approach aims to extract variable-length embedding sequence for a more flexible and efficient conditioning. We enhance the current affine coupling function in flow-based TTS architecture by introducing an attentive speaker conditioning. This allows a local variation of the speaker conditioning. Our experiments demonstrate the effectiveness of the proposed method, highlighting improvements in terms of speaker similarity, speech naturalness, and speech intelligibility compared to the baseline methods.
Published in: IEEE Signal Processing Letters ( Volume: 31)
Page(s): 899 - 903
Date of Publication: 18 March 2024

ISSN Information:

Publisher: IEEE

Funding Agency:


I. Introduction

Speech synthesis has made remarkable strides in recent years, thanks to advancements in deep learning and neural networks. Today, we can generate synthetic speech that is often indistinguishable from human speech. However, despite these achievements, there remains a challenging research problem: cloning the voice of a speaker who was unseen during training. This issue poses a significant challenge for text-to-speech (TTS) models, as accurately mimicking a specific speaker's voice requires tuning speech factors, including their timbre, accent, and unique characteristics, meticulously.

References

References is not available for this document.