Conferences >ICASSP 2023 - 2023 IEEE Inter...

Lip-to-Speech Synthesis in the Wild with Multi-Task Learning

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Recent studies have shown impressive performance in Lip-to-speech synthesis that aims to reconstruct speech from visual information alone. However, they have been sufferi...View more

Metadata

Abstract:

Recent studies have shown impressive performance in Lip-to-speech synthesis that aims to reconstruct speech from visual information alone. However, they have been suffering from synthesizing accurate speech in the wild, due to insufficient supervision for guiding the model to infer the correct content. Distinct from the previous methods, in this paper, we develop a powerful Lip2Speech method that can reconstruct speech with correct contents from the input lip movements, even in a wild environment. To this end, we design multitask learning that guides the model using multimodal supervision, i.e. text and audio, to complement the insufficient word representations of acoustic feature reconstruction loss. Thus, the proposed framework brings the advantage of synthesizing speech containing the right content of multiple speakers with unconstrained sentences. We verify the effectiveness of the proposed method using LRS2, LRS3, and LRW datasets.

Published in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 04-10 June 2023

Date Added to IEEE Xplore: 05 May 2023

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49357.2023.10095582

Conference Location: Rhodes Island, Greece

Funding Agency:

Contents

1. INTRODUCTION

With the recent development of Artificial Intelligence (AI) technology, interest in solving problems by connecting AI and humans is increasing to help human life. It is also necessary in human-to-human conversations in everyday life, especially when the importance of virtual meetings and video conferencing is highlighted. Among many problems in human-to-human conversations, the need for technologies recognizing an accurate conversation when voice signals are hardly available has been increasing. This technology is promising since it can help people understand conversation in situations like crowded shopping mall, party with lots of people and loud music, and silent video conference.

References is not available for this document.

Lip-to-Speech Synthesis in the Wild with Multi-Task Learning

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1. INTRODUCTION

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Lip-to-Speech Synthesis in the Wild with Multi-Task Learning

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1. INTRODUCTION

References

IEEE Account

Purchase Details

Profile Information

Need Help?