1. Introduction
We propose the first talking-face generation network, which can lip-sync any identity at ultra-high resolutions like 4K. Our model captures fine-grained details of the lip region, including color, texture, and essential features like teeth. While the current state-of-the-art model Wav2Lip [16] generates faces at 96×96 pixels (left part), our proposed method synthesizes 64 times more pixels, rendering realistic, high-quality results at 768 × 768 pixels.