FlashPortrait: 6× Faster Infinite Portrait Animation

This lightning talk explores FlashPortrait, a breakthrough in portrait animation that achieves 6× faster inference while maintaining perfect identity consistency across infinite-length videos. We examine the core innovation of adaptive latent prediction, which skips multiple denoising steps by predicting future states, and the normalized facial expression block that ensures facial features remain stable throughout extended animations. The presentation demonstrates how this approach transforms diffusion-based animation from a computationally expensive research curiosity into a practical tool for film production and virtual assistants.
Script
Current diffusion models can animate a portrait beautifully for a few seconds, but try to extend that to a full conversation and watch the person's face morph into someone else entirely. FlashPortrait solves this identity crisis while running 6 times faster than existing methods.
The problem runs deeper than just speed. When you generate a 30 second animation with current diffusion models, the person's face gradually drifts away from the original identity. Acceleration techniques like caching help with speed but make the identity problem even worse, because they skip the very steps that keep facial features consistent.
FlashPortrait attacks both problems simultaneously with a unified approach.
The normalized facial expression block embeds facial features directly into the diffusion process, creating a stable anchor that prevents identity drift. Meanwhile, adaptive latent prediction uses calculus to jump ahead in the denoising process. By computing higher-order derivatives of how latents evolve, the model predicts future states without running expensive diffusion steps, essentially fast-forwarding through the animation sequence.
The architecture reveals how these pieces fit together. Facial features from the reference image flow through specialized encoders into each block of the diffusion transformer. The sliding window approach processes overlapping segments, but here is the key innovation: instead of running full denoising for each new frame, the system predicts future latents from cached historical states. A weighted blending strategy smooths transitions between windows, ensuring that even as the model skips ahead, identity features remain locked in place.
The results are striking. FlashPortrait generates animations that maintain perfect facial identity across arbitrarily long sequences while running at speeds that make real-time interaction feasible. Benchmark comparisons show it outperforming models like Wan-Animate and FantasyPortrait on both consistency metrics and inference speed. This transforms portrait animation from a computationally expensive research demonstration into something you could actually deploy in production.
By predicting the future instead of computing it step by step, FlashPortrait proves that you do not have to choose between speed and quality. Visit EmergentMind.com to explore this paper in depth and create your own research videos.