Resolve the cause of the mismatch between empirical learning-rate trends and the LMO-bound-based predictions under joint batch and token scaling
Ascertain whether the observed gap between empirical learning-rate scaling trends—especially when both batch size and token budget increase—and the predictions obtained by minimizing the LMO-based nonconvex convergence bound used in this paper is primarily driven by protocol constraints, by violations of the proxy’s assumptions (such as constant learning rate, matched initialization, and bounded-variance noise), or by looseness of the convergence bound in large-model training regimes.
References
At the same time, our results highlight several limitations and open questions. More broadly, some modern empirical protocols report learning-rate trends that are not fully captured by our proxy, especially when both batch size and token budget increase. Resolving whether this gap is driven by protocol constraints, assumption mismatch, or looseness of the bound in modern large-model regimes remains an important direction for future work.