Muon Does Not Converge on Convex Lipschitz Functions

Published 9 May 2026 in cs.LG, math.OC, and stat.ML | (2605.08980v1)

Abstract: Muon and its variants have shown strong empirical performance in a variety of deep learning tasks. Existing convergence analyses of Muon rely on smoothness assumptions, though arguably the most successful function class for developing deep learning methods (such as AdaGrad, Shampoo, Schedule-Free and more) has been the class of convex and Lipschitz functions. In this paper we question whether the classical convex Lipschitz model is a useful one for understanding Muon. Our answer is no. We show that Muon does not converge on the class of convex and Lipschitz functions, regardless of the choice of learning rate schedule. We also show that error feedback restores convergence of Muon and all the non-Euclidean subgradient methods with momentum. However, this theoretical fix using error feedback degrades the performance of Muon in two representative settings for image classification (CIFAR-10) and language modeling (nanoGPT on FineWeb-Edu 10B). Our conclusion is that convex Lipschitz theory, despite having a prominent role in the design of practical methods for deep learning, is not the most suited one for Muon. This suggests that Muon's success must come from structure absent from this model, most plausibly related to smoothness.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper shows that Muon fails anytime convergence on convex Lipschitz objectives due to its non-Euclidean update mechanism.
It establishes that Muon reduces to signed momentum on diagonal objectives, enabling tractable analysis via counterexamples.
The proposed EF-Muon repair restores theoretical convergence but degrades performance in practical deep learning tasks.

Muon Optimization and Convergence Limitations on Convex Lipschitz Functions

Introduction

Muon and its algorithmic variants have demonstrated strong empirical successes in training deep neural networks, particularly in large-scale language modeling, diffusion models, and various high-capacity architectures. Despite these practical achievements, the theoretical understanding of Muon, especially its convergence properties, remains incomplete. While foundational adaptive optimization methods (e.g., AdaGrad, Shampoo, Adam) are backed by convergence guarantees in the convex Lipschitz setting, the current work establishes that Muon does not admit anytime convergence for convex Lipschitz (possibly nonsmooth) objectives, regardless of learning rate schedule. This result clarifies a foundational gap in the theoretical lens typically used to explain practical optimizer behavior in deep learning and presents both diagnostic tools and algorithmic modifications relevant for current and future analyses.

Convergence Failure of Muon on Convex Lipschitz Functions

Reduction to Signed Momentum

A principal contribution is the identification that, on diagonal matrix objectives, Muon reduces to the signed momentum algorithm—an update rule previously studied in the context of Adam and related optimizers. The paper presents a precise equivalence: for any function depending solely on the diagonal entries of the parameter matrix, Muon's iterates are exactly those of signed momentum. This reduction enables tractable analysis through well-understood lower-dimensional counterexamples.

Constructing Counterexamples

The authors construct explicit convex, Lipschitz, matrix-valued objectives wherein Muon, initialized within a specific set or outside of a measure-zero set, fails to converge. This non-convergence holds under both fixed (offline, vanishing) and adaptive stepsize regimes over the full range of momentum parameters typically deployed in practice ( $\beta \in [0, 1)$ or $[0, 2)$ , depending on the case). The lack of convergence arises inherently from the compression-like nature of the non-Euclidean update, as in Muon's spectral (operator norm–based) step, and is robust to typical subgradient selection protocol (namely, returning zero when differentiating the absolute value at zero).

A notable result is the extension of these classic counterexamples to include momentum, showing that introducing momentum does not fix the fundamental non-convergence—contrasting sharply with the smooth or strongly convex setting, where suitable rates are typically achieved.

Limitations of Standard Convergence Theory

These results place Muon alongside Adam, for which convergence failures on convex Lipschitz objectives are known (see [RKK18]). However, for Adam, algorithmic modifications such as AMSGrad restore convergence guarantees, while for Muon, the necessary fixes are more complex and, as detailed below, not always practically beneficial.

Error Feedback as a Theoretical Repair

EF-Muon as a Convergent Alternative

Motivated by analogies to compressed/stochastic optimization, the authors recast the Muon update as a compression operator and introduce an error feedback (EF) mechanism, leading to the EF-Muon algorithm. This structure accumulates and reapplies the distortion error introduced by compression (in this case, the spectral step), restoring anytime convergence guarantees for all non-Euclidean (including Muon-like) subgradient methods with momentum on the entire convex Lipschitz class.

Specifically, EF-Muon achieves optimal non-smooth rates up to problem-dependent constants, and the error feedback technique is shown generic across choices of compressor and norm, extending the theoretical toolkit for momentum-based non-Euclidean methods.

Empirical Performance Does Not Align With Theoretical Repair

Despite restored convergence in the worst-case regime, the introduction of error feedback (tested as EF-Muon and EF-MuonMax) degrades performance in representative practical regimes—namely, image classification with WideResNet-28-10 on CIFAR-10 and language modeling with nanoGPT on FineWeb-Edu. For both settings, unmodified Muon variants outperform their error-feedback-corrected counterparts, with the validation cross-entropy of EF-MuonMax significantly worse than that of MuonMax or vanilla Muon. This negative empirical result highlights a misalignment between theoretical repairs for pathological cases and the algorithm's success on realistic, non-pathological deep learning losses.

Broader Implications and Theoretical Outlook

The findings demonstrate that the standard convex Lipschitz framework, which has successfully explained the behavior of AdaGrad, Shampoo, and their derivatives, is inadequate for Muon without further restrictions or algorithmic modification. Thus, the practical efficacy of Muon must be attributed to function classes with additional structure (e.g., smoothness, growth conditions, or alternatives), and future theoretical analysis should focus on such regimes.

The use of operator-norm compression (via the polar factor) and the resulting equivalence to coordinatewise sign methods underscores the importance of considering the geometry of both the loss landscape and the optimizer when proposing analytic frameworks.

Additionally, these results provide a diagnostic tool: analysis of Muon on diagonal functions must be compatible with the theory for signed momentum. Any proposed extension, modification, or justification of Muon's empirical efficacy necessarily needs to circumvent or resolve the proven pathologies in the general convex Lipschitz setting.

Conclusion

This work rigorously delineates a limitation of Muon and analogous spectral momentum methods: their inability to guarantee convergence on the class of convex Lipschitz (possibly nonsmooth) objectives. Although error feedback mechanisms can provide theoretical convergence for this class, such modifications do not help and even harm empirical performance in canonical deep learning tasks, indicating that convex Lipschitz theory is an unsuitable model for understanding Muon's empirical success. These results direct theoretical efforts towards more refined geometric assumptions or fundamentally different analytic paradigms for the evolving landscape of non-Euclidean optimizers.

Markdown Report Issue