Accelerating LMO-Based Optimization via Implicit Gradient Transport

Published 7 May 2026 in cs.LG and cs.AI | (2605.05577v1)

Abstract: Recent optimizers such as Lion and Muon have demonstrated strong empirical performance by normalizing gradient momentum via linear minimization oracles (LMOs). While variance reduction has been explored to accelerate LMO-based methods, it typically incurs substantial computational overhead due to additional gradient evaluations. At the same time, the theoretical understanding of LMO-based methods remains fragmented across unconstrained and constrained formulations. Motivated by these limitations, we propose \emph{LMO-IGT}, a new class of stochastic LMO-based methods leveraging implicit gradient transport (IGT). We further introduce a unified framework for stochastic LMO-based optimization together with a new stationarity measure, the \emph{regularized support function} (RSF), which bridges gradient-norm and Frank--Wolfe-gap notions within a common framework. By evaluating stochastic gradients at transported points, LMO-IGT accelerates convergence while retaining the single-gradient-per-iteration structure of standard stochastic LMO. Our analysis establishes that stochastic LMO achieves an iteration complexity of $\mathcal{O}(\varepsilon^{-4})$, variance-reduced LMO achieves $\mathcal{O}(\varepsilon^{-3})$ at the cost of additional gradient evaluations, and LMO-IGT achieves $\mathcal{O}(\varepsilon^{-3.5})$ using only a single stochastic gradient per iteration. Empirically, LMO-IGT consistently improves over stochastic LMO counterparts with negligible overhead. Among its instantiations, Muon-IGT achieves the strongest overall performance across evaluated settings, demonstrating that IGT provides an effective and practical acceleration mechanism for modern LMO-based optimization.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces LMO-IGT, a novel method that leverages implicit gradient transport to accelerate convergence without extra gradient evaluations.
It establishes a unified RSF-based framework bridging unconstrained and constrained optimization via normalized and sign-based momentum updates.
Empirical evaluations on ResNet-18/CIFAR-10 show that Muon-IGT outperforms standard LMO and variance-reduced methods in both accuracy and computational efficiency.

Accelerating LMO-Based Optimization via Implicit Gradient Transport

Introduction and Problem Formulation

This paper addresses the optimization of large-scale, stochastic, and potentially nonconvex objectives as encountered in deep learning. The focus is on LMO-based (Linear Minimization Oracle) algorithms—especially those typified by Lion and Muon—that have gained recent empirical traction due to their ability to normalize gradient momentum and exploit structured update geometries. Despite progress, accelerated convergence typically necessitates variance reduction, which doubles gradient evaluation cost per iteration and is thus computationally burdensome. Moreover, theoretical analyses for unconstrained (Euclidean) and constrained (Frank–Wolfe-type) formulations are fragmented, impeding a unified understanding.

The key questions posed are:

Can LMO-based optimizers achieve acceleration beyond the $O(\epsilon^{-4})$ stochastic rate without the full cost of variance reduction?
Can one design and analyze these methods in a unified framework applicable to both unconstrained and constrained settings?

Unified Framework and Regularized Support Function

The authors present a unified framework for stochastic LMO-based optimization, parameterizing methods using a generic LMO update class that encapsulates normalized SGD, sign-based momentum, and matrix-structured updates. They introduce the Regularized Support Function (RSF) as a stationarity measure bridging the dual norm (for unconstrained problems) and the Frank–Wolfe gap (for constrained optimization), enabling direct comparison across algorithmic and geometric variants.

Given a convex set $\mathcal{C}$ and regularization $\lambda \geq 0$ , the RSF is defined as:

$\Psi_{\mathcal{C},\lambda}(w) := \sup_{v\in\mathcal{C}} \langle -\nabla F(w), v - \lambda w \rangle.$

This interpolates between the support function $h_{\mathcal{C}}(-\nabla F(w))$ (unconstrained) and $\lambda$ times the Frank–Wolfe gap $G_{\mathcal{P}}(w)$ (constrained with $\mathcal{P} = \lambda^{-1} \mathcal{C}$ ).

LMO-IGT: Implicit Gradient Transport for LMO-Based Methods

The principal algorithmic contribution is the LMO-IGT method, which adapts implicit gradient transport (IGT) to the LMO-based paradigm. The classical IGT technique mitigates momentum lag in normalized SGD by transporting gradients to a lookahead point, aligning the momentum buffer with the actual optimization trajectory. The novel insight here is that this mechanism generalizes to arbitrary LMO directions induced by $\mathcal{C}$ , e.g., for sign-based Lion, matrix-based Muon, or others.

IGT is implemented by updating momenta using gradients at an extrapolated point $x_t$ , which is a convex combination of the prior iterate(s) and the current LMO direction parameterized according to the momentum schedule. This preserves single stochastic gradient evaluation per iteration.

Figure 1: Stochastic momentum (left) and IGT (right); IGT aligns the momentum buffer to the optimization trajectory via extrapolated point evaluation.

Theoretical Results: Unified Stochastic Rates

Detailed non-asymptotic convergence rates under the RSF are established for three algorithmic classes—plain stochastic LMO, variance-reduced LMO (LMO-VR), and the new LMO-IGT:

Stochastic LMO achieves an $\mathcal{C}$ 0 iteration complexity, i.e., to reach expected RSF $\mathcal{C}$ 1, $\mathcal{C}$ 2 gradient oracle accesses suffice.
LMO-VR leverages STORM-like corrections, reducing complexity to $\mathcal{C}$ 3, at the price of doubling the gradient cost per iteration.
LMO-IGT attains $\mathcal{C}$ 4 complexity, matching much of the benefit of VR without extra gradient evaluations. This acceleration is enabled by exploiting second-order smoothness via gradient evaluation at transported points.

These results are obtained under standard $\mathcal{C}$ 5-smoothness, bounded variance, and when appropriate, $\mathcal{C}$ 6-Lipschitz Hessian assumptions. The analysis is precise, with exact dependencies on momentum parameters and learning rates. Importantly, all are cast in terms of the RSF, allowing direct cross-geometry comparison.

Empirical Evaluation

Experiments on ResNet-18 / CIFAR-10 (as well as more extensive setups in the appendix) compare AdamW, NIGT, Lion, Muon, their VR variants, and IGT-enabled versions. Hyperparameter grids are carefully searched for each method to ensure fair comparison.

Muon-IGT consistently yields the highest test accuracy and lowest test loss, outperforming both NIGT (Euclidean IGT) and Muon (operator-norm LMO) baselines. Notably, VR variants (e.g., Muon-VR, Lion-VR) sometimes fail to beat plain IGT or stochastic LMO, despite their better asymptotic rates, due to wall-clock time and additional computation. This suggests the advantage of LMO-IGT is not solely theoretical but represents a computationally efficient acceleration mechanism in practice.

Figure 2: Test accuracy over epochs for AdamW, NIGT, Lion, Muon, and IGT/VR variants on CIFAR-10.

Implications and Future Directions

This work clarifies the relationship between normalized, sign-based, and matrix-aware LMO optimizers and their acceleration regimes. The RSF framework eliminates analytical inconsistencies between gradient-norm and FW-gap stationarity, making direct comparison theoretically sound. The main implication is that implicit gradient transport provides a pragmatic path to accelerate LMO-based methods without incurring the costly overheads of variance reduction, often central in modern large-scale model training.

Potential future directions include:

Extending the unified RSF-based analysis to biased or approximate LMOs.
Generalizing the approach to non-stochastic (deterministic) settings, or establishing high-probability convergence guarantees.
Translating the method to even more structured parameter spaces, such as those emerging in very deep models or models with adaptive network architecture.
Further empirical study of the limits of IGT acceleration in the presence of data heterogeneity (e.g., in federated learning) or severe nonconvexity.

Conclusion

The paper delivers a comprehensive theoretical and empirical study of low-overhead acceleration for LMO-based optimizers. By introducing implicit gradient transport to this family, a new tradeoff frontier is opened: practitioners gain a method that nearly matches variance reduction’s improved convergence rates but with the computational efficiency of single-gradient-per-iteration optimizers. The unified theory, operationalized via the regularized support function, offers a broad analytical lens for future development and benchmarking of optimization algorithms in deep learning.

Markdown Report Issue