I. Introduction
Speech synthesis is the task of artificially generating human speech and is used in various fields. As a speech synthesis technology currently in practical use, text-to-speech Synthesis (TTS) is a method of generating speech from arbitrary text. Text-to-speech synthesis has been studied for many years, and it is considered to be a very difficult task, and various methods have been proposed. However, with the recent development of deep learning technology, the accuracy of text-to-speech synthesis has increased dramatically, and it is possible to synthesize speech at a level that is almost indistinguishable from humans with current technology. Furthermore, by setting parameters such as inflection and speed, using a model of speech learned based on specific emotions, and post-processing the generated speech, the synthesized speech acquires various expressions. However, these methods require human coordination, and basically, the representation of speech is often dependent on the input text, so we believe that there is still room for further development. In fact, a multimodal speech synthesis method that provides information other than text as input has been proposed and is attracting attention for use in entertainment and other fields. Therefore, this study aims at generating artificial speech with richer expression by multimodal information using text and face images, and we propose a model to achieve this. In this study, we focus on lip information in face images, and aim at synthesizing speech sounds corresponding to images of lip movement.