Resolve the cause of the mismatch between empirical learning-rate trends and the LMO-bound-based predictions under joint batch and token scaling

Ascertain whether the observed gap between empirical learning-rate scaling trends—especially when both batch size and token budget increase—and the predictions obtained by minimizing the LMO-based nonconvex convergence bound used in this paper is primarily driven by protocol constraints, by violations of the proxy’s assumptions (such as constant learning rate, matched initialization, and bounded-variance noise), or by looseness of the convergence bound in large-model training regimes.

Background

The paper derives hyperparameter scaling laws by minimizing recent convergence bounds for LMO-based methods (e.g., normalized SGD, signSGD approximating Adam, and Muon). While several empirical trends are recovered, the authors note persistent discrepancies with some modern training protocols—particularly reports of increasing optimal learning rates as the token horizon grows when batch size is also scaled.

The conclusion explicitly frames an open question about the source of this mismatch, listing three non-exclusive possibilities: constraints in practical protocols, mismatches with modeling assumptions (such as constant learning rates, matched initialization, and bounded-variance mini-batch noise), or looseness of the theoretical bound when applied to large-scale models.

References

At the same time, our results highlight several limitations and open questions. More broadly, some modern empirical protocols report learning-rate trends that are not fully captured by our proxy, especially when both batch size and token budget increase. Resolving whether this gap is driven by protocol constraints, assumption mismatch, or looseness of the bound in modern large-model regimes remains an important direction for future work.

Deriving Hyperparameter Scaling Laws via Modern Optimization Theory  (2603.15958 - Shulgin et al., 16 Mar 2026) in Conclusion