- The paper introduces AUDIODEAR, which achieves one-step latent generation for text-to-audio synthesis using an energy-scoring objective and diffusion-based distillation.
- It replaces iterative diffusion sampling with an energy-distance measure that directly maps noise to audio latents, drastically reducing inference latency.
- Empirical results on AudioCaps and WavCaps benchmarks show up to 8.5x faster inference with only minimal degradations in audio quality metrics.
Fast One-Step Text-to-Audio Generation via Energy-Scoring and Representation Distillation
Overview and Motivation
Autoregressive (AR) models with diffusion-based sampling have achieved state-of-the-art performance for text-to-audio (TTA) generation. However, the generative quality comes at the cost of substantial inference latency due to iterative decoding and multi-step sampling procedures. This latency bottleneck is especially problematic for real-time applications. The paper "Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation" (2605.00329) introduces AUDIODEAR, a framework that enables efficient, high-fidelity TTA synthesis with one-step latent generation and strong text conditioning.
AUDIODEAR integrates an energy-distance-based training objective with representation-level distillation from a diffusion-trained masked autoregressive (MAR) teacher model. The approach eliminates the need for recursive diffusion sampling and leverages contextual knowledge transfer to preserve semantic alignment and audio quality.
Methodology
Energy-Scoring Objective
The core innovation is the replacement of diffusion loss with an energy-distance objective, a statistical measure quantifying discrepancy between distributions via expected pairwise Euclidean distances. The energy-scoring head maps Gaussian noise directly to audio latents conditioned on context embedding, enabling a single forward pass for latent generation.
Mathematically, the energy-distance D(P,Q) between model and ground-truth distributions is minimized to learn latent mapping:
D(P,Q)=2E[∣X−Y∣]−E[∣X−X′∣]−E[∣Y−Y′∣]
Training estimates this objective with Monte Carlo sampling; during inference, the head generates each latent by processing contextual representation and noise.
Representation-Level Distillation
To further close the quality gap to multi-step diffusion models, AUDIODEAR adopts auxiliary distillation from a fixed diffusion-trained transformer (IMPACT). The student transformer, trained under the energy-distance objective, minimizes mean squared error between its layer-wise hidden representations and those of the teacher. This knowledge transfer enforces strong conditioning and semantic alignment.
The combined loss is:
Ltotal​=Lenergy​+λLdistill​
where Lenergy​ is the energy-distance loss and Ldistill​ is the contextual representation alignment loss.
Iterative Parallel Decoding
Inference employs iterative parallel decoding in an AR manner: masked latents are progressively replaced with predicted values generated in one sampling step per position. This preserves the sequence autoregressive structure for alignment and consistency, while avoiding iterative diffusion steps at each decoding iteration.
Architectural Details
- Latent Representation: Mel spectrograms encoded by VAE; patched and augmented with text embeddings (Flan-T5 and, optionally, CLAP).
- Backbone: IMPACT-Base 24-layer transformer (hidden dim 768), energy-scoring head implemented as residual MLP blocks with adaptive layer normalization.
- Distillation Teacher: IMPACT diffusion-based transformer, fixed during student training.
Empirical Evaluation
AUDIODEAR was evaluated on AudioCaps and WavCaps benchmarks, with a unified training set including AudioSet. The system consistently outperformed prior one-step baselines (ConsistencyTTA, SoundCTM, AudioLCM, AudioTurbo) on both objective metrics (Fréchet Distance, Fréchet Audio Distance, KL divergence, Inception Score, CLAP similarity) and subjective measures (text relevance, overall audio quality).
Compared to IMPACT, the state-of-the-art multi-step diffusion TTA, AUDIODEAR achieves:
- Up to 8.5x faster batch inference for 10-second audio generation
- Only minor degradations in IS (≤8.6%) and CLAP (≤10.2%)
- Significant narrowing of quality gap between one-step and multi-step models
Ablations
- Increasing distillation weight improves objective scores, peaking at λ=1000.
- Two sample estimation for energy-distance strikes optimal balance; larger sample counts bring marginal fidelity gains but degrade semantic and diversity scores.
- Classifier-Free Guidance (CFG) scale optimally set to 4.0 for best semantic alignment and perceptual quality.
- Energy-scoring head configuration with noise as main input consistently yields better performance.
- Representation-level CFG is uniquely effective in one-step models due to lack of error accumulation across sampling steps.
Comparative Analysis with Alternative One-Step Methods
Toy experiments and benchmark comparisons demonstrate that AUDIODEAR's energy-scoring mechanism yields tighter alignment (lower MMD/WSD) to data manifold than competing Shortcut and MeanFlow approaches. The method also recovers the full spread of target distributions, mitigating common coverage collapse in one-step mapping.
Practical and Theoretical Implications
Practically, AUDIODEAR enables real-time or batch synthesis in TTA systems by drastically reducing latency without sacrificing quality. The efficient one-step sampling can be deployed in interactive media, accessibility tools, and multimodal agents, where responsiveness is critical.
Theoretically, combining energy-distance minimization with contextual representation distillation advances the paradigm for rapid generative modeling across continuous modalities, creating new pathways to reconcile the speed-quality trade-off in AR-diffusion frameworks. This technique is transferable to other domains such as image, speech, and video, with potential for broader multimodal generative architectures.
Future Directions
The research identifies promising avenues for further latency reduction, particularly via minimizing AR decoding steps while maintaining perceptual and semantic fidelity. Exploration of more expressive conditional representations or enhanced distillation mechanisms may push one-step approaches closer to oracle multi-step diffusion.
Conclusion
AUDIODEAR demonstrates that one-step latent generation, empowered by energy-scoring and auxiliary distillation, provides an effective recipe for low-latency, high-quality TTA synthesis. The empirical results validate substantial efficiency gains and near-parity in quality to multi-step diffusion, offering practical deployment opportunities and theoretical advances for continuous generative modeling (2605.00329).