Determine how optimal schedules interact with model size and parametrization (μP) and with problem smoothness and noise properties
Determine how the optimal power-law schedules for learning rate, batch size, and momentum—derived from minimizing the LMO-based proxy bound—interact with model size N, effective smoothness, gradient noise characteristics, and parametrization choices such as μP.
References
Another important limitation is that we hold model size fixed, leaving open how the optimal schedules interact with N, effective smoothness, gradient noise, and parametrization choices such as μP.
— Deriving Hyperparameter Scaling Laws via Modern Optimization Theory
(2603.15958 - Shulgin et al., 16 Mar 2026) in Conclusion — Limitations and future work