- The paper presents a comprehensive taxonomy of LLM optimizers, clarifying designs from AdamW to matrix-based methods.
- It analyzes memory, computation, and convergence trade-offs, emphasizing the impact of optimizer state on large-scale training.
- The study highlights system-level challenges and outlines future research directions for scalable, efficient LLM optimization.
Navigating the LLM Optimizer Landscape: From AdamW to Memory-Efficient and Matrix-Based Methods
Introduction and Problem Context
LLM training, dominated by autoregressive Transformers at trillion-scale parameters and tokens, exposes significant optimization, memory, and systems constraints. While AdamW is entrenched as the de facto optimizer in this domain due to its robust convergence, adaptive scaling, and decoupled regularization properties, the exponential growth in memory consumption owing to per-parameter moment states has catalyzed a proliferation of alternative optimizers. This survey, "Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers" (2605.09176), delivers a highly structured, systems- and optimization-centric synthesis of the evolving optimizer landscape, mapping design patterns, trade-offs, and research frontiers critical to scalable, efficient LLM training.
Taxonomy of Optimizer Families for LLMs
This work develops a comprehensive taxonomy, organizing optimizers along five axes: update geometry, state memory, structural assumptions, target regime, and evaluation criteria. The principal families are as follows:
- Classical First-Order: SGD, momentum, and variants serve as low-memory baselines but lack robustness to gradient heterogeneity observed in Transformer blocks.
- Adaptive Diagonal: Adam, AdamW, Adafactor. These achieve coordinate-wise learning rates but are hamstrung by high memory demands (first and second moments for each parameter).
- Large-Batch and Distributed: LAMB and related methods stabilize optimization at high batch size via trust ratio mechanisms, addressing throughput on parallel hardware.
- Memory-Efficient: Adafactor (factored moments), 8-bit optimizers (quantized states), Adam-mini (grouped adaptivity), LOMO (fused fine-tuning), GaLore (low-rank gradient projection), APOLLO-like (SGD-level memory with AdamW-level performance).
- Sign-Based and Discovered: signSGD and Lion (symbolically discovered) forego per-parameter scale adaptation for sign-based updates, reducing memory/complexity but changing learning-rate sensitivity.
- Curvature-Aware and Second-Order: Shampoo, Sophia, and SOAP-like optimizers use scalable diagonal/matrix approximations of the Hessian or Fisher, trading memory for better conditioning.
- Low-Rank/Projection-Based: GaLore and related methods project gradients into low-rank subspaces, updating full parameter sets at a fraction of the memory cost.
- Matrix-Based and Orthogonalized: Muon orthogonalizes momentum updates using Newton–Schulz iterations, leveraging the matrix structure of large weight tensors.
- Quasi-Newton: L-BFGS, mL-BFGS, and related stochastic quasi-Newton variants enable curvature-inspired updates with bounded history rather than full Hessian storage.
The paper emphasizes that optimizer family selection must be examined through the systems lens: peak memory, wall-clock throughput, numerical stability, implementation overhead, and architectural compatibility are as critical as convergence per token.
Analysis of Classical, Adaptive, and Memory-Efficient Optimizers
Classical and Adaptive Baselines
SGD and momentum, while computationally and memory efficient, are typically suboptimal for massive LLMs due to their inability to account for parameter-specific gradient scales and heterogeneity. AdamW, by combining momentum, diagonal adaptation, and decoupled weight decay [loshchilov2019decoupled], addresses these issues, empirically dominating LLM pretraining regimes [kingma2015adam].
Architectural Challenges
The survey rigorously details how optimizer-state memory (typically two full fp32 vectors per parameter for AdamW) bottlenecks feasible model size, context length, and batch size—especially pronounced in mixed-precision and distributed regimes [shazeer2018adafactor, rajbhandari2020zero]. Distributed strategies such as optimizer sharding (e.g., ZeRO, FSDP) provide partial relief, but do not remove the inherent linear scaling of optimizer state with parameter count.
Memory-Efficient Systems
Adafactor exploits the matrix shape of Transformer parameters, using factorized second-moment statistics (O(m+n) instead of O(mn) for each matrix), reducing optimizer state by orders of magnitude for large matrices [shazeer2018adafactor]. 8-bit optimizers leverage blockwise quantization to further shrink state with negligible impact on convergence [dettmers2022optimizers]. Adam-mini reduces adaptive state by grouping parameters, maintaining only a fraction of the per-parameter learning rates [zhang2025adammini]. LOMO fuses gradient computation with parameter updates, targeting memory-constrained full-parameter fine-tuning [lv2024lomo]. APOLLO achieves SGD-like memory costs with AdamW-level convergence [han2024apollo], empirically challenging the necessity of two moment tensors.
Beyond Coordinate-Wise Adaptivity: Structural and Matrix-Based Advances
Low-Rank and Projection Approaches
GaLore [zhao2024galore] exemplifies the low-rank family, maintaining optimizer state within a dynamic or fixed-rank subspace for large matrices. This provides substantial memory relief as long as the effective rank of sample gradients remains low—an empirical feature of Transformer gradients, though with variable rank requirements depending on phase and scale.
Matrix-Based and Orthogonalized Updates
Muon [jordan2024muon, liu2025muon] posits that matrix regularization and geometric structure should be exploited directly in optimizer updates, not via post-hoc normalization or parameter-specific learning rates. Orthogonalization of update matrices using Newton–Schulz iterations is both computationally tractable on modern hardware and aligns more closely with the linear algebraic realities of Transformer layers. Empirical results in subsequent works [liu2025muon] demonstrate that Muon scales to LLM pretraining, enabling stable and efficient optimization with strictly lower optimizer memory than AdamW.
Curvature-Aware and Second-Order Techniques
Classically intractable Newton updates are approximated in scalable methods: Shampoo maintains Kronecker-factored preconditioners along tensor dimensions, balancing fidelity and memory [gupta2018shampoo, anil2021scalable]. Sophia approximates diagonal Hessian information for preconditioning stochastic gradients and demonstrates improved convergence in language modeling pretraining despite minimal overhead [liu2024sophia]. The practical implementation of such methods must handle mixed precision, distributed computation, and update frequency trade-offs.
Benchmarking, Methodology, and Fair Evaluation
The review identifies rigorous methodology as critical for optimizer assessment, highlighting pitfalls such as under-tuned AdamW baselines, early-curve overclaiming, and incomplete memory reporting. The recommended evaluation protocol emphasizes:
- Comparison on identical model, batch, token budget, and hardware setups
- Multi-objective reporting: validation loss per token, wall-clock time, peak memory, downstream task transfer, and resource-constrained regimes
- Explicit reporting of optimizer configuration, state precision, parameter grouping, and distributed/sharding support
- Ablations for rank, projection intervals, and orthogonalization steps, as applicable
Recent empirical studies [zhao2025deconstructing, semenov2025benchmarking] underscore that optimizer advantages often compress under rigorous, scale-aware tuning and that early-stage gains may wash out at scale.
Open Problems and Future Research Trajectories
The survey outlines several critical open problems:
- Optimizer Scaling Laws: Comprehensive understanding of optimizer-token-parameter scaling relationships is lacking; small-scale gains are not reliably predictive at LLM scales.
- Memory–Compute–Convergence Frontiers: Quantifying multi-resource trade-offs (memory per parameter, token efficiency, step overhead) is necessary for principled optimizer selection.
- Parameter-Group Specific and Adaptive Assignment: Heterogeneous groups (e.g., MLP, attention, embeddings, normalization) may have distinct optimizer requirements, necessitating hybrid or learned allocation of optimizer strategies.
- Matrix-Theoretic and Numerical Analysis: The precise optimization principles underpinning matrix-based (e.g., Muon) and low-rank update geometries are not fully understood. Theoretical progress here would enable more robust adaptation and scaling.
- Long-Context, Fine-Tuning, and Post-Training Regimes: Optimizer requirements in continual pretraining, domain adaptation, and RLHF/post-training diverge from bulk pretraining and require systematized study.
The field is explicitly moving toward system- and architecture-aware optimizer design, encompassing hybrid strategies, dynamic adaptation, and hardware–algorithm co-design.
Conclusion
The LLM optimizer landscape has evolved beyond a single dominant rule. AdamW remains foundational due to its empirical stability, but memory, architecture, and systems-level constraints have necessitated the exploration of memory-efficient, sign-based, matrix-aware, curvature-informed, and projection-based variants. These alternatives offer significant trade-offs in memory, wall-clock efficiency, implementation complexity, and convergence fidelity, particularly as model and data scales continue to rise. A mature, scalable optimization science for LLMs will require unified progress across algorithm design, theoretical characterization, benchmarking methodology, and distributed systems engineering, with a continued emphasis on fair, multi-metric, scale-aware comparison.
The primary implication is that optimizer research is now a multi-objective, system-embedded field. Future advances will be incremental, context-specific, and tightly integrated with evolving LLM training infrastructure and theoretical understanding, rather than radical leaps from a single replacement for AdamW.