Create a Video View Paper

End-to-End Training for Unified Tokenization and Latent Denoising

This presentation explores UNITE, a groundbreaking framework that challenges conventional multi-stage generative modeling by unifying tokenization and latent denoising into a single-stage architecture. Through weight-shared Generative Encoders trained from scratch without external supervision, UNITE achieves competitive performance with state-of-the-art diffusion models while dramatically simplifying the training pipeline and eliminating dependence on pretrained encoders. We examine the architectural innovation, empirical results across images and molecules, and the surprising representational alignment that emerges when reconstruction and generation objectives jointly shape a latent space.

Script

For decades, we've trained generative models in stages: first build a tokenizer, freeze it, then train a generator on top. But what if this separation is holding us back? What if tokenization and generation, trained together from scratch, could discover better latent spaces than either could alone?

The standard approach trains an autoencoder to compress data, freezes those latent representations, and only then fits a diffusion model. This means the generator never gets to influence what makes a good latent space. Worse, many methods rely on pretrained encoders requiring over 1000 times the compute just to initialize the pipeline.

UNITE eliminates this staged approach entirely through a weight-shared architecture.

The Generative Encoder operates in dual modes using the same parameters. As a tokenizer, it distills images into latents. As a generator, it refines noisy latents into clean ones. Both objectives sculpt the same latent space simultaneously, with gradients from reconstruction and denoising converging on representations that excel at both tasks.

This analysis reveals something remarkable: the internal representations learned by the tokenization pathway align almost perfectly with those used during generation, particularly in deeper layers. When the model encodes an image and then denoises a corrupted version of that latent, the intermediate activations track each other closely. This alignment emerges naturally from joint training and proves essential for stable, high-quality generation. Crucially, removing the stop-gradient on clean latents disrupts this alignment, degrading both representation quality and denoising trajectories.

UNITE matches state-of-the-art two-stage diffusion models on ImageNet generation while using fewer parameters and a single training run. The tokenizer achieves sub-1 reconstruction FID without any adversarial objectives or external supervision. Beyond images, the same architecture excels at molecular generation, nearly perfectly reconstructing complex 3D molecular structures and outperforming methods that use separate variational autoencoders. This demonstrates true domain agnosticism.

UNITE proves that the decades-old separation between tokenization and generation was never necessary. By letting both objectives shape a shared latent space from scratch, we can build simpler, more efficient generative models that work across domains without the baggage of pretrained encoders. Visit EmergentMind.com to learn more and create your own research videos.