Conferences >2024 IEEE/SICE International ...

Synthesis of Speech Reflecting Features from Lip Images

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Recently, models that synthesize speech from multimodal information other than text have been devised and have attracted much attention. We focus on lip image information...Show More

Metadata

Abstract:

Recently, models that synthesize speech from multimodal information other than text have been devised and have attracted much attention. We focus on lip image information and propose a model for speech synthesis that reflects lip movements. The architecture consists of an image feature extractor using an autoencoder and an encoder-decoder model similar to the Tacotron2 that outputs a mel spectrogram. We have succeeded in synthesizing speech that reflects lip movements under limited conditions.

Published in: 2024 IEEE/SICE International Symposium on System Integration (SII)

Date of Conference: 08-11 January 2024

Date Added to IEEE Xplore: 09 February 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/SII58957.2024.10417556

Conference Location: Ha Long, Vietnam

Contents

I. Introduction

Speech synthesis is the task of artificially generating human speech and is used in various fields. As a speech synthesis technology currently in practical use, text-to-speech Synthesis (TTS) is a method of generating speech from arbitrary text. Text-to-speech synthesis has been studied for many years, and it is considered to be a very difficult task, and various methods have been proposed. However, with the recent development of deep learning technology, the accuracy of text-to-speech synthesis has increased dramatically, and it is possible to synthesize speech at a level that is almost indistinguishable from humans with current technology. Furthermore, by setting parameters such as inflection and speed, using a model of speech learned based on specific emotions, and post-processing the generated speech, the synthesized speech acquires various expressions. However, these methods require human coordination, and basically, the representation of speech is often dependent on the input text, so we believe that there is still room for further development. In fact, a multimodal speech synthesis method that provides information other than text as input has been proposed and is attracting attention for use in entertainment and other fields. Therefore, this study aims at generating artificial speech with richer expression by multimodal information using text and face images, and we propose a model to achieve this. In this study, we focus on lip information in face images, and aim at synthesizing speech sounds corresponding to images of lip movement.

References is not available for this document.

MIT Libraries

MIT Libraries

Synthesis of Speech Reflecting Features from Lip Images

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

Synthesis of Speech Reflecting Features from Lip Images

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

References