Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation

Published 1 May 2026 in cs.SD and eess.AS | (2605.00329v1)

Abstract: Autoregressive (AR) models with diffusion heads have recently achieved strong text-to-audio performance, yet their iterative decoding and multi-step sampling process introduce high-latency issues. To address this bottleneck, we propose a one-step sampling framework that combines an energy-distance training objective with representation-level distillation. An energy-scoring head maps Gaussian noise directly to audio latents in one step, eliminating the need for a costly recursive diffusion sampling process, while distillation from a masked autoregressive (MAR) text-to-audio model preserves the strong conditioning learned during diffusion training. On the AudioCaps benchmark, our method consistently outperforms prior one-step baselines such as ConsistencyTTA, SoundCTM, AudioLCM and AudioTurbo, on both objective and subjective metrics, while substantially narrowing the quality gap to AR diffusion systems with multi-step sampling. Compared to the state-of-the-art AR diffusion system, IMPACT, our approach achieves up to $8.5$x faster batch inference with highly competitive audio quality. These results demonstrate that combining energy-distance training with representation-level distillation provides an effective recipe for fast, high-quality text-to-audio synthesis.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper introduces AUDIODEAR, which achieves one-step latent generation for text-to-audio synthesis using an energy-scoring objective and diffusion-based distillation.
It replaces iterative diffusion sampling with an energy-distance measure that directly maps noise to audio latents, drastically reducing inference latency.
Empirical results on AudioCaps and WavCaps benchmarks show up to 8.5x faster inference with only minimal degradations in audio quality metrics.

Fast One-Step Text-to-Audio Generation via Energy-Scoring and Representation Distillation

Overview and Motivation

Autoregressive (AR) models with diffusion-based sampling have achieved state-of-the-art performance for text-to-audio (TTA) generation. However, the generative quality comes at the cost of substantial inference latency due to iterative decoding and multi-step sampling procedures. This latency bottleneck is especially problematic for real-time applications. The paper "Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation" (2605.00329) introduces AUDIODEAR, a framework that enables efficient, high-fidelity TTA synthesis with one-step latent generation and strong text conditioning.

AUDIODEAR integrates an energy-distance-based training objective with representation-level distillation from a diffusion-trained masked autoregressive (MAR) teacher model. The approach eliminates the need for recursive diffusion sampling and leverages contextual knowledge transfer to preserve semantic alignment and audio quality.

Methodology

Energy-Scoring Objective

The core innovation is the replacement of diffusion loss with an energy-distance objective, a statistical measure quantifying discrepancy between distributions via expected pairwise Euclidean distances. The energy-scoring head maps Gaussian noise directly to audio latents conditioned on context embedding, enabling a single forward pass for latent generation.

Mathematically, the energy-distance $D(P, Q)$ between model and ground-truth distributions is minimized to learn latent mapping:

$D(P, Q) = 2\mathbb{E}[|X-Y|] - \mathbb{E}[|X-X'|] - \mathbb{E}[|Y-Y'|]$

Training estimates this objective with Monte Carlo sampling; during inference, the head generates each latent by processing contextual representation and noise.

Representation-Level Distillation

To further close the quality gap to multi-step diffusion models, AUDIODEAR adopts auxiliary distillation from a fixed diffusion-trained transformer (IMPACT). The student transformer, trained under the energy-distance objective, minimizes mean squared error between its layer-wise hidden representations and those of the teacher. This knowledge transfer enforces strong conditioning and semantic alignment.

The combined loss is:

$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{energy}} + \lambda \mathcal{L}_{\text{distill}}$

where $\mathcal{L}_{\text{energy}}$ is the energy-distance loss and $\mathcal{L}_{\text{distill}}$ is the contextual representation alignment loss.

Iterative Parallel Decoding

Inference employs iterative parallel decoding in an AR manner: masked latents are progressively replaced with predicted values generated in one sampling step per position. This preserves the sequence autoregressive structure for alignment and consistency, while avoiding iterative diffusion steps at each decoding iteration.

Architectural Details

Latent Representation: Mel spectrograms encoded by VAE; patched and augmented with text embeddings (Flan-T5 and, optionally, CLAP).
Backbone: IMPACT-Base 24-layer transformer (hidden dim 768), energy-scoring head implemented as residual MLP blocks with adaptive layer normalization.
Distillation Teacher: IMPACT diffusion-based transformer, fixed during student training.

Empirical Evaluation

Performance and Latency

AUDIODEAR was evaluated on AudioCaps and WavCaps benchmarks, with a unified training set including AudioSet. The system consistently outperformed prior one-step baselines (ConsistencyTTA, SoundCTM, AudioLCM, AudioTurbo) on both objective metrics (Fréchet Distance, Fréchet Audio Distance, KL divergence, Inception Score, CLAP similarity) and subjective measures (text relevance, overall audio quality).

Compared to IMPACT, the state-of-the-art multi-step diffusion TTA, AUDIODEAR achieves:

Up to 8.5x faster batch inference for 10-second audio generation
Only minor degradations in IS (≤8.6%) and CLAP (≤10.2%)
Significant narrowing of quality gap between one-step and multi-step models

Ablations

Increasing distillation weight improves objective scores, peaking at $\lambda=1000$ .
Two sample estimation for energy-distance strikes optimal balance; larger sample counts bring marginal fidelity gains but degrade semantic and diversity scores.
Classifier-Free Guidance (CFG) scale optimally set to 4.0 for best semantic alignment and perceptual quality.
Energy-scoring head configuration with noise as main input consistently yields better performance.
Representation-level CFG is uniquely effective in one-step models due to lack of error accumulation across sampling steps.

Comparative Analysis with Alternative One-Step Methods

Toy experiments and benchmark comparisons demonstrate that AUDIODEAR's energy-scoring mechanism yields tighter alignment (lower MMD/WSD) to data manifold than competing Shortcut and MeanFlow approaches. The method also recovers the full spread of target distributions, mitigating common coverage collapse in one-step mapping.

Practical and Theoretical Implications

Practically, AUDIODEAR enables real-time or batch synthesis in TTA systems by drastically reducing latency without sacrificing quality. The efficient one-step sampling can be deployed in interactive media, accessibility tools, and multimodal agents, where responsiveness is critical.

Theoretically, combining energy-distance minimization with contextual representation distillation advances the paradigm for rapid generative modeling across continuous modalities, creating new pathways to reconcile the speed-quality trade-off in AR-diffusion frameworks. This technique is transferable to other domains such as image, speech, and video, with potential for broader multimodal generative architectures.

Future Directions

The research identifies promising avenues for further latency reduction, particularly via minimizing AR decoding steps while maintaining perceptual and semantic fidelity. Exploration of more expressive conditional representations or enhanced distillation mechanisms may push one-step approaches closer to oracle multi-step diffusion.

Conclusion

AUDIODEAR demonstrates that one-step latent generation, empowered by energy-scoring and auxiliary distillation, provides an effective recipe for low-latency, high-quality TTA synthesis. The empirical results validate substantial efficiency gains and near-parity in quality to multi-step diffusion, offering practical deployment opportunities and theoretical advances for continuous generative modeling (2605.00329).

Markdown Report Issue