Papers
Topics
Authors
Recent
Search
2000 character limit reached

AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

Published 1 May 2026 in cs.CL and cs.DC | (2605.00539v1)

Abstract: Quantization is a key method for reducing the GPU memory requirement of training LLMs. Yet, current approaches are ineffective for 4-bit activations and 8-bit gradients, which would easily cause slow convergence or accuracy loss. To address this, we introduce AGoQ, incorporating two new techniques: 1) a layer-aware activation quantization algorithm that allocates appropriate bit-widths for activations of various layers based on their types and pipeline stages to achieve near 4-bit activation storage, and 2) a gradient quantization algorithm that reduces memory usage and shortens communication time by employing 8-bit gradient storage and precision-preserving 8-bit All-Reduce communication. We conduct extensive experiments using different sizes of LLMs on two GPU clusters (up to 64 GPUs), and the experimental results show that our AGoQ reduces the memory by up to 52\% and achieves up to 1.34$\times$ improvement of training speed compared to state-of-the-art training systems Megatron-LM (w/ or w/o ZeRO), COAT and DeepSpeed with 8B to 32B LLaMA models, while achieving convergence loss on pretraining and comparable accuracy on downstream tasks with LLaMA architectures.

Summary

  • The paper introduces a quantization framework (AGoQ) that applies layer-aware 4-bit activation quantization (excluding attention layers) and 8-bit gradient quantization to reduce memory and communication overhead.
  • It employs dynamic bit-width assignment based on theoretical error bounds along with fused CUDA kernels to achieve up to 1.34× throughput speedup while preserving training accuracy.
  • Experimental results demonstrate up to 52% reduction in memory footprint and robust convergence on LLaMA variants, making distributed training scalable for large LLMs.

Memory-Efficient LLM Training via Layer-Aware Quantization: The AGoQ Framework

Introduction

LLMs have driven substantial advances in natural language processing, but their training imposes severe memory and communication demands on distributed GPU clusters. While techniques like mixed-precision training, activation recomputation, optimizer state quantization, and communication optimizations have incrementally improved efficiency, further progress is often constrained by memory consumption from activations and gradients—particularly as model, sequence, and batch sizes scale. AGoQ ("Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs") (2605.00539) presents a quantization framework addressing these core bottlenecks using a joint, theoretically grounded approach to activation and gradient quantization, tailored for practical, scalable LLM training with competitive accuracy.

Core Contributions

AGoQ introduces two synergistic innovations for reducing memory and communication overhead in distributed LLM training:

  1. Layer-Aware Activation Quantization (LAAQ): Recognizing heterogeneous sensitivity to quantization across layer types and pipeline stages, AGoQ adaptively quantizes activations down to approximately 4 bits per element, based on per-layer theoretical error bounds. Crucially, it excludes Attention activations from quantization due to their adverse impact on gradient stability, establishing a principled lower bound for quantization granularity.
  2. Precision-Preserved 8-bit Gradient Quantization (QuanGrad): Gradients are stored and communicated in 8 bits, using a blockwise quantization scheme combined with communication patterns that avoid the overflow and accumulation issues endemic to fixed-point collective operations. This includes a decomposition of All-Reduce into All-to-All and All-Gather steps with local dequantization and accumulation in higher precision, followed by requantization.

These advances are tightly integrated within a distributed pipeline-parallel and tensor-parallel LLM training paradigm (e.g., Megatron-LM/DeepSpeed), supporting adaptive recomputation, kernel-level fusion of quantization/dequantization with GEMM, and compatibility with existing 8-bit optimizer quantization.

Quantization Methodology and Theoretical Analysis

Activation Quantization

AGoQ's analysis demonstrates that quantizing activations in layers such as RMSNorm, SiLU, and non-attention MLPs introduces bounded gradient errors that remain asymptotically smaller (or comparable) if intermediate values are recomputed from quantized inputs rather than cached, given appropriate norm inequalities. In contrast, gradient errors for Attention activations grow rapidly with sequence length and embedding dimension, precluding robust 4-bit quantization of Attention projections.

The strategy further exploits interleaved pipeline parallelism to apply dynamic bit-width assignment: devices with lower peak activation mini-batch counts (and thus more available memory) may use higher bit-widths for sensitive activations, maximizing accuracy without exceeding memory budgets. Uniform 4-bit quantization across all layers is empirically demonstrated not to converge, reinforcing the necessity of nuanced, layer- and stage-aware bit allocation.

Gradient Quantization

Blockwise FP8 quantization is used for gradients, with special handling to avoid summation overflow during local accumulation and distributed All-Reduce. By locally dequantizing accumulated gradients to FP16/BF16, summing in high precision, and requantizing, AGoQ avoids catastrophic loss. All-Reduce is decomposed to separated All-to-All and All-Gather steps, which reduces communication overhead and is more robust to reduced representational capacity.

The system incorporates fused CUDA kernels for quantization/dequantization and GEMM, introducing negligible additional latency relative to standard computation while providing significant peak memory and throughput improvements.

Experimental Results

AGoQ is evaluated on a 64-GPU InfiniBand cluster using LLaMA variants (7B, 8B, 13B, 34B), OLMo-1B, and CodeLLaMA-34B. Its performance is benchmarked against Megatron-LM (with or without ZeRO), COAT, DeepSpeed, and FP8-LM.

Memory and Throughput Efficiency

  • Memory Reduction: Up to 52% reduction in training memory footprint versus baseline, with per-component memory savings of 30% for activations (over COAT) and 75% for gradients.
  • Throughput Speedup: Up to 1.34× improved training speed for long sequences and large models (e.g., LLaMA2-13B at 80K tokens) compared to Megatron-LM/ZeRO-1. Speedup is consistent across scaling regimes (GPUs, model size, PP/TP degree) and reaches up to 1.23× average across diverse tasks.
  • Out-of-Memory (OOM) Avoidance: Fewer OOM runs compared to Megatron-LM/ZeRO under aggressive scaling or long sequence configurations.

Accuracy and Convergence

  • Convergence Robustness: Training curves for LLaMA2-7B and LLaMA3-8B track full-precision baselines, with no observable degradation in pretraining or downstream (zero-shot) task accuracy (mean differences <$0.02$ on ARC, PIQA, HellaSwag, SciQ, Winogrande, etc.).
  • Ablations: Uniform 4-bit quantization fails to converge, underscoring the necessity of layer-aware allocation. Each quantization module (activations, gradients, optimizer state) contributes to aggregate memory saving.
  • Gradient Error: Empirically, largest normalized L2L_2 errors appear in Attention (which remains full-precision); quantized layers exhibit minimal error, in agreement with theoretical predictions.

Distributed Communication and Practicality

  • Kernel Fusion: Fusing quantization with GEMM achieves an additional 1.07× kernel speedup, reflected in end-to-end speed improvements.
  • Bandwidth Adaptivity: Communication speedups (up to 3.7× over All-Reduce) persist under commodity bandwidth conditions (10 Gbps), broadening applicability to less specialized infrastructure.

Practical and Theoretical Implications

AGoQ demonstrates that fine-grained, layer-sensitive quantization—grounded in error norm and layer-type analysis—can yield substantial efficiency gains in LLM training without loss of accuracy. Its approach contrasts with prior uniform quantization methods by exploiting architectural specifics, memory usage heterogeneity in pipeline stages, and tailored gradient communication, making it suitable for next-generation models with high sequence or embedding dimensionality.

Practically, the method increases effective model and sequence size for infrastructure-limited clusters, facilitates rapid experimentation, and supports longer-context models within fixed resource envelopes. Its compatibility with optimizer quantization and integration into mainstream frameworks (e.g., Megatron-LM) supports adoption by practitioners.

Theoretically, the approach establishes a methodology for quantization error analysis in deep distributed settings, suggesting future quantization schemes should be similarly adaptive and analytically justified. For gradient compression, the scheme illustrates the importance of communication-aware accumulation and non-uniform collective operation design.

Future Directions

Potential future work includes:

  • Extension to Other Architectures: Adapting LAAQ for models with extensive cross-layer dependencies or new normalization/attention variants.
  • Mixed-Precision Adaptation: Population-based or learned bit-width allocation as a function of training dynamics and error gradients.
  • Robustness and Compression Beyond LLMs: Extension to vision or multimodal architectures, or integration with post-training quantization and inference acceleration pipelines.

Conclusion

AGoQ provides an evidence-based, highly effective framework for memory- and communication-efficient LLM training, combining selective 4-bit activation quantization and precision-preserving 8-bit gradient quantization tailored by layer type and pipeline stage (2605.00539). It achieves strong reductions in memory footprint and improves throughput, all while maintaining convergence parity with full-precision training, establishing a practical template for the scalable training of next-generation foundation models.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.