Generalization of Low-Rank RLVR Update Structure Beyond GRPO/Qwen-Math
Determine whether the low-rank structure of parameter update trajectories—specifically, that per-tensor parameter deltas concentrate in a dominant rank-1 subspace—observed during GRPO-based reinforcement learning with verifiable rewards on mathematical reasoning tasks for Qwen-family models also holds for other reinforcement learning algorithms such as PPO, for other task domains such as code generation, and for other model families such as Llama.
References
Whether similar low-rank structure holds for other RL algorithms (, PPO), other task forms (, code generation), or other model families (, Llama) remains open.
— You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
(2605.21468 - Wei et al., 20 May 2026) in Limitations, Section 7 (Discussion)