Lip-to-Speech Synthesis in the Wild with Multi-Task Learning | IEEE Conference Publication | IEEE Xplore

Lip-to-Speech Synthesis in the Wild with Multi-Task Learning

Publisher: IEEE

Abstract:

Recent studies have shown impressive performance in Lip-to-speech synthesis that aims to reconstruct speech from visual information alone. However, they have been sufferi...View more

Abstract:

Recent studies have shown impressive performance in Lip-to-speech synthesis that aims to reconstruct speech from visual information alone. However, they have been suffering from synthesizing accurate speech in the wild, due to insufficient supervision for guiding the model to infer the correct content. Distinct from the previous methods, in this paper, we develop a powerful Lip2Speech method that can reconstruct speech with correct contents from the input lip movements, even in a wild environment. To this end, we design multitask learning that guides the model using multimodal supervision, i.e. text and audio, to complement the insufficient word representations of acoustic feature reconstruction loss. Thus, the proposed framework brings the advantage of synthesizing speech containing the right content of multiple speakers with unconstrained sentences. We verify the effectiveness of the proposed method using LRS2, LRS3, and LRW datasets.
Date of Conference: 04-10 June 2023
Date Added to IEEE Xplore: 05 May 2023
ISBN Information:

ISSN Information:

Publisher: IEEE
Conference Location: Rhodes Island, Greece

Funding Agency:


1. INTRODUCTION

With the recent development of Artificial Intelligence (AI) technology, interest in solving problems by connecting AI and humans is increasing to help human life. It is also necessary in human-to-human conversations in everyday life, especially when the importance of virtual meetings and video conferencing is highlighted. Among many problems in human-to-human conversations, the need for technologies recognizing an accurate conversation when voice signals are hardly available has been increasing. This technology is promising since it can help people understand conversation in situations like crowded shopping mall, party with lots of people and loud music, and silent video conference.

References

References is not available for this document.