Conferences >2023 IEEE/CVF Conference on C...

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Talking head synthesis is a promising approach for the video production industry. Recently, a lot of effort has been devoted in this research area to improve the generati...Show More

Metadata

Abstract:

Talking head synthesis is a promising approach for the video production industry. Recently, a lot of effort has been devoted in this research area to improve the generation quality or enhance the model generalization. However, there are few works able to address both issues simultaneously, which is essential for practical applications. To this end, in this paper, we turn attention to the emerging powerful Latent Diffusion Models, and model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk). More specifically, instead of employing audio signals as the single driving factor, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis. In this way, the proposed DiffTalk is capable of producing high-quality talking head videos in synchronization with the source audio, and more importantly, it can be naturally generalized across different identities without further finetuning. Additionally, our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost. Extensive experiments show that the proposed DiffTalk efficiently synthesizes high-fidelity audio-driven talking head videos for generalized novel identities. For more video results, please refer to https://sstzal.github.io/DiffTalk/.

Published in: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Date of Conference: 17-24 June 2023

Date Added to IEEE Xplore: 22 August 2023

ISBN Information:

ISSN Information:

DOI: 10.1109/CVPR52729.2023.00197

Conference Location: Vancouver, BC, Canada

Funding Agency:

Contents

1. Introduction

Talking head synthesis is a challenging and promising research topic, which aims to generate video portraits with given audio. This technique is widely applied in various practical scenarios including animation, virtual avatars, online education, and video conferencing [4], [45],[48], [51], [54].

References is not available for this document.

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

References