To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video.

vLLM-Omni: Disaggregated Any-to-Any Multimodal Serving

This presentation explores vLLM-Omni, a breakthrough serving system that handles any-to-any multimodal models through fully disaggregated stage execution. We'll examine how it represents complex inference pipelines as stage graphs, enabling independent optimization of each component while achieving up to 91% reduction in job completion time compared to traditional serving approaches.

Script

What happens when you try to serve a model that can take speech, generate text, create images, and synthesize audio all in one pipeline? Current serving systems buckle under this complexity, forcing developers into painful manual orchestration that kills performance.

The core challenge emerges from a fundamental mismatch. These sophisticated models combine autoregressive language model stages with diffusion transformers and other specialized components, but existing serving frameworks are built for step-centric, single-modality workflows.

The authors introduce a radically different approach called vLLM-Omni that embraces this complexity rather than fighting it.

Instead of forcing everything through a single serving paradigm, vLLM-Omni represents the entire inference pipeline as a stage graph where each node can be independently optimized and scaled.

This disaggregated approach fundamentally changes how we think about serving. Rather than cramming heterogeneous computation into a single framework, each stage gets the execution engine that fits its specific needs.

The architecture reveals how elegant this abstraction becomes in practice. An orchestrator coordinates the entire pipeline while independent engines handle autoregressive language model stages and diffusion transformer stages separately, connected through a unified data transfer mechanism that can handle everything from embeddings to audio waveforms.

Let's examine how the authors made this vision concrete.

Each stage gets its own scheduler, memory manager, and execution engine optimized for its specific computational pattern. Autoregressive stages leverage vLLM's proven token-by-token generation optimizations, while diffusion stages get specialized denoising acceleration.

The unified connector handles the complex choreography of moving intermediate data between stages, from small control messages to large tensor transfers, with optimizations for both single-node shared memory and distributed deployments.

The system integrates cutting-edge optimizations throughout the pipeline. Diffusion stages benefit from attention acceleration and intelligent caching, while streaming enables downstream stages to begin processing before upstream stages complete.

The performance improvements prove this approach delivers on its promise.

The authors evaluated across multiple any-to-any models and datasets, comparing against the standard implementations that developers would naturally reach for when building these systems.

For Qwen2.5-Omni, the improvements are substantial but this is just the beginning of what disaggregated serving enables.

The Qwen3-Omni results demonstrate the true potential of this approach, with over 90% reductions in completion time and speedups approaching an order of magnitude for individual stages.

The benefits extend across modalities, with particularly impressive gains for audio synthesis where the specialized optimizations compound effectively.

Crucially, the data transfer overhead remains minimal compared to the tens of seconds required for full pipeline execution, validating that disaggregation doesn't introduce prohibitive communication costs.

Several key technical insights emerge from this work.

The fundamental insight is that different stages of any-to-any models have radically different computational patterns, and trying to serve them uniformly creates inevitable compromises that hurt overall performance.

The streaming capabilities unlock pipeline-level parallelism where downstream stages can begin processing partial outputs, dramatically reducing the perceived latency for interactive applications.

This work establishes a new paradigm for serving complex AI systems that extends far beyond current any-to-any models to future architectures that combine diverse computational patterns.

vLLM-Omni demonstrates that the future of AI serving lies not in forcing complex pipelines into simple abstractions, but in embracing their complexity through intelligent disaggregation. To explore more cutting-edge research in AI systems and optimization, visit EmergentMind.com.