Portrait animation aims to synthesize talking videos from a static reference face, conditioned on audio and style frame cues (\eg, emotion and head poses), while ensuring precise lip synchronization and faithful reproduction of speaking styles. Existing diffusion-based portrait animation methods mainly focus on lip synchronization or static emotion transformation, often overlooking dynamic styles such as head movements. Moreover, most of these methods rely on dual U-Net architecture, which preserves identity consistency but incurs additional computational overhead. To this end, we propose DiTalker, a unified DiT-based framework for speaking style controllable portrait animation. We design a Style-Emotion Encoding Module that employs two separate branches: a style branch extracts style embeddings for head poses and movements, and an emotion branch extracts emotion features. We further introduce an Audio-Style Fusion Module that decouples audio and speaking styles via two parallel cross attention layers, using these features to guide the animation process. To enhance the quality of results, we further introduce two optimization constraints: one to improve lip synchronization and the other to preserve fine-grained identity details. Extensive experiments demonstrate the superiority of DiTalker in terms of lip synchronization and speaking style controllability.
TL;DR: We propose DiTalker, a DiT-based model for portrait animation that achieves precise lip sync and dynamic style control through dedicated audio-style fusion, outperforming existing methods..
Overview of our proposed DiTalker, which consists of a Style-Emotion Encoding Module (SEEM), an Audio-Style Fusion Module (ASFM), and the DiT generation backbone. SEEM takes style frames Vs and phonemes extracted from driving audio a as inputs, extracting style features cs and emotion features cemo. ASFM uses cs and ca (extracted by the Audio Encoder) as inputs, injecting information into the DiT backbone through two attention layers, and scales the outputs of the two attentions using sϕ and sα extracted by the Scale Adapter in SEEM. The DiT generation backbone animates the Ref Image x (or Video Frames) based on the features provided by SEEM and ASFM. Emotion Cross Attention is inserted after ASFM to enhance emotion control, where cemo serves as keys and values.
If you want to use this fully-responsive and easy-to-adapt homepage template, you can download it from the github repository.