Prove applicability of the SFM mechanism to SGD-trained Transformers
Establish whether the grokking-like sharp crossover and complexity-minimization mechanism implemented by the Singular Feature Machine—specifically the free-energy objective with an Occam Gate–style sparsity constraint—also governs grokking in Transformers trained with stochastic gradient descent on modular arithmetic tasks, by deriving or validating a formal connection between SGD dynamics and an equivalent free-energy/sparsity mechanism.
References
Any extrapolation of this mechanism to SGD-trained Transformers is a qualitative hypothesis consistent with observed trends, but it is not proven here.
— Grokking From Abstraction to Intelligence
(2603.29262 - Zhang et al., 31 Mar 2026) in Section 5, Thermodynamics of Learning: Singular Projection Dynamics (after Eq. (4))