Determine how optimal schedules interact with model size and parametrization (μP) and with problem smoothness and noise properties

Determine how the optimal power-law schedules for learning rate, batch size, and momentum—derived from minimizing the LMO-based proxy bound—interact with model size N, effective smoothness, gradient noise characteristics, and parametrization choices such as μP.

Background

All theoretical results in the paper are obtained at fixed model size, under simplifying assumptions including constant learning rate, matched initialization, and a specific noise model. The authors explicitly note that understanding interactions with model size and parametrization (e.g., μP), as well as with effective smoothness and gradient noise properties, remains open.

Such interactions are crucial for translating the scaling prescriptions to practical large-scale training recipes where model size grows with compute and data and parametrization choices affect both optimization geometry and noise structure.

References

Another important limitation is that we hold model size fixed, leaving open how the optimal schedules interact with N, effective smoothness, gradient noise, and parametrization choices such as μP.

Deriving Hyperparameter Scaling Laws via Modern Optimization Theory  (2603.15958 - Shulgin et al., 16 Mar 2026) in Conclusion — Limitations and future work