OmniTransfer: Unified Spatio-temporal Video Transfer
An overview of OmniTransfer, a framework that unifies appearance and temporal video transfer tasks using reference videos. It introduces novel mechanisms like Task-aware Positional Bias and Reference-decoupled Causal Learning to achieve high consistency without external priors.Script
Have you ever tried to generate a video that captures *both* the specific artistic style of one clip and the complex camera motion of another? usually, AI models force you to pick just one context. Today we are exploring a paper that breaks this compromise by treating video transfer as a single, unified problem.
Getting there requires solving two distinct problems that usually fight each other. On one side, appearance transfer often relies on static images, missing the multi-angle consistency needed for video. On the other, temporal transfer usually demands rigid constraints like skeletons or expensive retraining for every new effect.
The researchers propose a clever workaround: they convert these temporal problems into spatial ones. Their key insight is that diffusion models maintain consistency better when they process reference frames as if they were spatially adjacent context, rather than just sequential history. This allows the model to reposition reference data using embeddings to suit the specific task.
To implement this, the authors introduce three architectural choices. First, they decouple the reference and target streams to stop the model from simply copy-pasting pixels. Then, they use a Task-aware Positional Bias to tell the model whether to focus on style or motion, and finally, they align everything's semantics using a multimodal language model.
You can see the impact of these components in this ablation comparison. While a standard baseline struggles with motion blur and identity drift, adding the positional bias stabilizes the movement, and the decoupled learning mechanism cleans up the artifacts, resulting in high-fidelity transfer without the computational cost of full attention.
By unifying these tasks into one framework, OmniTransfer achieves state-of-the-art results in ID, style, and motion transfer simultaneously, without needing external control signals. For a deeper look at their custom dataset and causal attention mechanisms, check out the full paper on EmergentMind.com.