Papers
Topics
Authors
Recent
Search
2000 character limit reached

CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism

Published 16 Apr 2026 in cs.DC | (2604.14561v2)

Abstract: Diffusion Transformers (DiTs) are increasingly adopted in scientific computing, yet growing model sizes and resolutions make distributed multi-GPU inference essential. Ulysses sequence parallelism scales DiT inference but introduces frequent all-to-all collectives that dominate latency. Overlapping these with computation is difficult due to tight data dependencies, large message volumes, and asymmetric interconnect bandwidths. We introduce CoCoDiff, a distributed DiT inference engine exploiting two observations: (1) V requires only linear projection while Q/K need additional normalization and RoPE, creating opportunities to overlap V's communication with Q/K computation; (2) adjacent denoising steps produce similar tensors, yielding temporal redundancy. CoCoDiff introduces three mechanisms: Tile-Aware Parallel All-to-all (TAPA) decomposes collectives into topology-aligned phases; V-First scheduling hides V's communication behind Q/K computation; and V-Major selective communication transmits only active projections on slow interconnects. On the Aurora supercomputer with four DiT models across 1-8 nodes (up to 96 Intel GPU tiles), CoCoDiff achieves an average speedup of 3.6x, peaking at 8.4x.

Summary

  • The paper introduces CoCoDiff, which co-optimizes communication scheduling and application semantics to address performance bottlenecks in distributed Diffusion Transformer inference.
  • It details novel mechanisms including TAPA, V-First Scheduling, and V-Major Selective Communication that leverage hardware heterogeneity and temporal redundancy.
  • Empirical results on models like Qwen-Image demonstrate an average speedup of 3.6× and up to 8.4× at high-resolution, while preserving output quality.

Optimizing Collective Communication for Distributed Diffusion Transformer Inference: A Technical Analysis of CoCoDiff

Introduction and Motivation

Diffusion Transformers (DiTs) have become standard in high-resolution generative modeling, both in industrial and scientific domains. Scaling DiT inference to larger resolutions and models necessitates distributed multi-GPU execution. Ulysses sequence parallelism is the de facto approach for distributing the token sequence among compute devices; however, this paradigm introduces a critical performance bottleneck: frequent and high-volume all-to-all collective communications for the Q/K/V tensors in attention layers. Profiling large models such as Qwen-Image (20B) on the Aurora supercomputer reveals that all-to-all collectives can account for 64–96% of total inference time, especially as image resolution increases. Figure 1

Figure 1: Communication overhead in Qwen-Image (20B) inference on Aurora, showing all-to-all collectives dominate inference time as resolution scales.

Conventional strategies to overlap communication with computation are hampered by tight intra-layer and inter-layer dependencies, asynchronous compute/comm progression mismatches, and constrained memory hierarchies. These challenges motivate the design of CoCoDiff, a distributed inference engine for DiTs that explicitly co-optimizes communication scheduling and application semantics to minimize latency under Ulysses sequence parallelism (2604.14561).

CoCoDiff Design: Principles and Mechanisms

CoCoDiff’s approach is predicated on two domain-specific insights:

  1. QKV Asymmetry in Attention: Only the value projection (V) is a pure linear mapping, while the queries (Q) and keys (K) incur extra computational steps (normalization and rotary position encoding). This enables early initiation and possible overlapping of V communication with Q/K computation.
  2. Temporal Redundancy in Diffusion Denoising: Consecutive denoising steps often yield similar intermediate representations, creating opportunities for selective communication—only transmitting those projections that have significantly changed.

Building on these, CoCoDiff implements three core mechanisms:

Tile-Aware Parallel All-to-All (TAPA)

Standard collective libraries fail to exploit the non-uniform, hierarchical bandwidths of contemporary supercomputers (e.g., Intel Aurora’s tile, GPU, and interconnect bandwidths). TAPA decomposes the all-to-all into two explicit phases: (i) intra-GPU (high bandwidth, fully parallel tile exchanges), and (ii) inter-GPU (lower bandwidth, parallel group communications aligned with the physical topology). Figure 2

Figure 2: Tile-aware parallel all-to-all: Phase separation exploits Aurora’s bandwidth hierarchy for maximal throughput.

V-First Scheduling

CoCoDiff leverages the QKV computation asymmetry to commence communication of V as soon as its projection is complete, overlapping V’s Phase 1 communication (intra-GPU) with the subsequent Q/K normalization and RoPE operations. The scheduling is proved beneficial whenever the Q/K computation exceeds the intra-GPU comm time—a property that holds across current DiT architectures and working batch sizes. Figure 3

Figure 3: QKV processing in attention; only Q/K undergo extra normalization and RoPE.

Figure 4

Figure 4: Execution timeline comparison—(c) V-first scheduling allows hiding of Phase~1 comm within Q/K’s normalization and RoPE.

V-Major Selective Communication

Exploiting temporal redundancy, V-Major identifies stable tokens by comparing value projections sequentially across denoising steps. Only “active” projections (determined using the L1L_1 norm difference and a dynamic cache ratio) are communicated in the inter-GPU phase; the rest are cached and reused. This selective strategy drastically reduces bandwidth requirements without introducing material accuracy regression. Figure 5

Figure 5: CoCoDiff overview, showing joint design of TAPA, V-First, and V-Major components.

Evaluation and Strong Numerical Results

Empirical analysis on the Aurora supercomputer using FLUX, SD3.5, Qwen-Image, and FLUX.2-dev models demonstrates that CoCoDiff outperforms flat baselines—delivering a mean end-to-end speedup of 3.6×, with a maximum of 8.4× at high-resolution, large-node configurations. TAPA alone yields 2.3×–2.5×, but the incremental benefit of V-First and V-Major only manifests at multi-node scales due to increased relative all-to-all overhead. Figure 6

Figure 6

Figure 6: End-to-end speedups of TAPA and CoCoDiff for various models and resolutions. Substantial and scalable gains inherent to CoCoDiff’s selectivity at scale.

Figure 7

Figure 7: Component ablation on SD3.5: gain from each CoCoDiff subsystem (TAPA, V-First, V-Major) at different scales.

Crucially, the quality of model outputs (e.g., inpainting PSNR and SSIM) is statistically indistinguishable between the flat, TAPA, and CoCoDiff variants. This is further visualized on large-scale medical and scientific datasets, confirming negligible degradation from projection caching. Figure 8

Figure 8: Example inpainting outputs from FLUX.1-dev, showing original, masked, restored, and error-mapped images.

Trade-off curves further illustrate that adaptive, time-varying cache ratios strictly outperform fixed policies in communication/quality space: aggressive caching can be safely scheduled in late denoising steps to maximize speedup without observable accuracy loss. Figure 9

Figure 9

Figure 9: Inpainting quality (PSNR/SSIM) vs. speedup: time-varying cache ratio dominates the Pareto frontier.

Communication Microbenchmarks

TAPA’s hierarchical scheduling alone reduces all-to-all latency over oneCCL or flat baselines by up to 3.9× for large messages. Hierarchy-aligned execution consistently realizes higher observed effective bandwidths, especially as token sequence lengths scale quadratically with resolution. Figure 10

Figure 10: All-to-all communication latency per TAPA phase versus standard oneCCL/flat collectives on a 12-tile Aurora node.

Implications and Prospects

CoCoDiff marks a shift from protocol-agnostic collectives to application-semantics-aware distributed inference. Its explicit co-design of scheduling, selection, and hardware topology is particularly salient for hardware with significant interconnect heterogeneity; similar optimizations could be extended to other hierarchical-GPU clusters (e.g., upcoming datacenter fabrics with CXL, NVLink).

The selective caching principle, proven effective for inference, may generalize to other iterative, temporally coherent workloads (e.g., video diffusion, sequential RL). The demonstrated decoupling of runtime communication costs from full model size removes key barriers to scaling next-generation DiT-based synthesis and analysis pipelines in science and industry.

Theoretical Grounding

  • Communication complexity under Ulysses/SP increases as O(N2)O(N^2), where NN is the token sequence length; computation grows only linearly. CoCoDiff’s selective step transforms this to O(rN2)O(rN^2), where rr is the cache ratio, yielding order-of-magnitude theoretical bandwidth savings in late-stage denoising.
  • The approach leverages cross-step representation similarity, orthogonal to single-device compute caching [ma2024deepcache, chen2024deltadit], yet uniquely adapted to reduce inter-device traffic under global layout constraints.

Conclusion

CoCoDiff establishes that coordinated enhancements across the communication stack and inference schedule can fundamentally shift the cost profiles of distributed diffusion model inference under sequence parallelism. The strong speedups, coupled with preserved output fidelity and hardware-agnostic adaptability, point to its viability as a foundational layer for future extreme-scale generative AI deployments.

Reference: (2604.14561)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.