Prove applicability of the SFM mechanism to SGD-trained Transformers

Establish whether the grokking-like sharp crossover and complexity-minimization mechanism implemented by the Singular Feature Machine—specifically the free-energy objective with an Occam Gate–style sparsity constraint—also governs grokking in Transformers trained with stochastic gradient descent on modular arithmetic tasks, by deriving or validating a formal connection between SGD dynamics and an equivalent free-energy/sparsity mechanism.

Background

The paper introduces the Singular Feature Machine (SFM) as a tractable surrogate that makes explicit a free-energy-like objective coupled with a sparsity-inducing "Occam Gate". This surrogate reproduces grokking-like delayed generalization and allows analytical proxies for complexity and RLCT.

However, the authors stress that SFM is not derived from the actual SGD dynamics of Transformers and that the proposed mechanism is presently a hypothesis rather than a proven equivalence. Clarifying whether SGD-trained Transformers undergo an analogous implicit complexity-minimization process is crucial to elevate the surrogate explanation into a validated account of real training dynamics.

References

Any extrapolation of this mechanism to SGD-trained Transformers is a qualitative hypothesis consistent with observed trends, but it is not proven here.

Grokking From Abstraction to Intelligence  (2603.29262 - Zhang et al., 31 Mar 2026) in Section 5, Thermodynamics of Learning: Singular Projection Dynamics (after Eq. (4))