DiTalker: A Unified DiT-based Framework for High-Quality and Speaking Styles Controllable Portrait Animation

He Feng, et al.

DiTalker

DiTalker: A Unified DiT-based Framework for High-Quality and Speaking Styles Controllable Portrait Animation

Paper Video Supplementary Arxiv Poster Code

Abstract

Portrait animation aims to synthesize talking videos from a static reference face, conditioned on audio and style frame cues (\eg, emotion and head poses), while ensuring precise lip synchronization and faithful reproduction of speaking styles. Existing diffusion-based portrait animation methods mainly focus on lip synchronization or static emotion transformation, often overlooking dynamic styles such as head movements. Moreover, most of these methods rely on dual U-Net architecture, which preserves identity consistency but incurs additional computational overhead. To this end, we propose DiTalker, a unified DiT-based framework for speaking style controllable portrait animation. We design a Style-Emotion Encoding Module that employs two separate branches: a style branch extracts style embeddings for head poses and movements, and an emotion branch extracts emotion features. We further introduce an Audio-Style Fusion Module that decouples audio and speaking styles via two parallel cross attention layers, using these features to guide the animation process. To enhance the quality of results, we further introduce two optimization constraints: one to improve lip synchronization and the other to preserve fine-grained identity details. Extensive experiments demonstrate the superiority of DiTalker in terms of lip synchronization and speaking style controllability.

TL;DR: We propose DiTalker, a DiT-based model for portrait animation that achieves precise lip sync and dynamic style control through dedicated audio-style fusion, outperforming existing methods..

Method Overview

Overview of our proposed DiTalker, which consists of a Style-Emotion Encoding Module (SEEM), an Audio-Style Fusion Module (ASFM), and the DiT generation backbone. SEEM takes style frames V_s and phonemes extracted from driving audio a as inputs, extracting style features c_s and emotion features c_emo. ASFM uses c_s and c_a (extracted by the Audio Encoder) as inputs, injecting information into the DiT backbone through two attention layers, and scales the outputs of the two attentions using s_ϕ and s_α extracted by the Scale Adapter in SEEM. The DiT generation backbone animates the Ref Image x (or Video Frames) based on the features provided by SEEM and ASFM. Emotion Cross Attention is inserted after ASFM to enhance emotion control, where c_emo serves as keys and values.

Video Results

Qualitative comparison with baseline on the HDTF test set.

Qualitative comparison with baseline on the Mix Emotion test set.

Qualitative results under background noise (such as crowd noise, street noise, heavy rain noise, etc).

Qualitative comparison results under driven audio with accents (strong English and Chinese accents)

Qualitative results under highly emotionally driven audio.

Homepage Template

If you want to use this fully-responsive and easy-to-adapt homepage template, you can download it from the github repository.