- The paper introduces SSF, a subspace optimization method that corrects heterogeneity and matches full-dimensional SCAFFOLD's convergence while reducing communication and memory costs.
- It employs subspace projection with residual backfill and dynamic projector refresh to stabilize client updates in non-IID data settings.
- Empirical evaluations on matrix regression and CIFAR-100 demonstrate SSF's robust convergence and efficiency gains despite a moderate accuracy gap compared to full-dimensional methods.
Subspace Optimization for Efficient Federated Learning under Heterogeneous Data
Federated Learning (FL) in large-model regimes faces endemic systems constraints: communication, computation, and memory bottlenecks are ubiquitous, especially with edge deployments of models containing O(108) or more parameters. These issues are severely exacerbated by client data heterogeneity; local objectives are non-IID, yielding client drift and degrading both optimization and generalization. Full-dimensional heterogeneity correction methods such as SCAFFOLD [karimireddy2020scaffold] are theoretically robust but introduce prohibitive resource overheads—they maintain, communicate, and update O(d)-sized control variates per client, and retain high-dimensional error buffers. Prior efforts to reduce systems costs via compression, quantization, or subspace parameterizations (e.g., GaLore [Zhao2024GaLoreML]) often fail to robustly address drift in non-IID regimes: memory and bandwidth decrease, but convergence stability is sacrificed, or substantial SNR is required to avoid information loss. No method in the literature natively integrates heterogeneity correction within the evolving geometry of low-dimensional subspace updates.
Method: Subspace SCAFFOLD (SSF)
SSF closes this gap by performing heterogeneity-corrected FL in a low-dimensional subspace. Each round t uses a shared random orthonormal projector Pt​∈Rr×d. All local and global optimization steps, and all communicated tensors, are represented in this r-dimensional subspace. Crucially, SSF maintains a full-dimensional view of the control variates, but only projects and updates the active subspace components each round. The historical orthogonal residuals are preserved ("backfilled") and reincorporated when the subspace changes, preventing catastrophic information loss or divergence during projector rotations.
Local updates mimic SCAFFOLD: each client i performs K rounds of heterogeneity-corrected SGD in the subspace, using Pt​gik​−ci,projt​+cprojt​. After aggregation, the global model is reconstructed in the ambient space. Communication payload is now O(r) per vector, and per-client memory transfer can be similarly restricted, as the ambient-space control state can reside in slow (e.g., CPU) memory and only projected slices are required for active computation.
Key mechanisms:
- Subspace projection: All primal updates and dual corrections are performed with low-dimensional projections.
- Residual backfill: The full-space dual state is preserved, mitigating instability under subspace rotation, unlike pure subspace-dual variants (e.g., FedSub).
- Projector refresh: The active subspace changes each round to maximize expressiveness and stationarity.
Theoretical Analysis
SSF's convergence theory is developed under standard L-smoothness and bounded-variance assumptions, with client sampling and per-round random subspaces. The analysis yields several salient propositions:
- Non-asymptotic convergence: SSF matches the nonconvex convergence rate of full-dimensional SCAFFOLD, up to a O(d)0 factor for subspace ratio O(d)1:
O(d)2
with matching stochastic terms, linear speedup in O(d)3 (clients, local steps), and no additive heterogeneity penalties.
- Drift control: By preserving the orthogonal complement of the dual, SSF robustly bounds client drift per round even as the subspace evolves, with projected stochastic-variance scaling O(d)4 (see Lemma: Client Drift Bound).
- Control variate contraction: At each step, the control error contracts as O(d)5 in expectation, plus noise and drift effects, yielding geometric decay of dual error for sufficiently large O(d)6 (see Lemma: Control Variate Contraction).
- Optimal stepsizes: Harmonic stepsize tuning yields tight non-asymptotic rates with no requirement for strong convexity, bounded gradients, or projector-noise independence (cf. the Corollary).
Compared to subspace-only correction (FedSub), SSF is provably more stable; FedSub's analysis requires much stronger assumptions and cannot guarantee the same linear speedup or robustness even in moderate dimensions.
Empirical Evaluation
Experiments validate SSF on two axes: controlled matrix regression and CIFAR-100 with ResNet-110.
Matrix Regression: Heterogeneity and Subspace Scaling
At moderate subspace ratio (O(d)7) SSF consistently matches the performance of full-dimensional SCAFFOLD and outperforms both FedAvg and FedSub, irrespective of heterogeneity (see Table~\ref{tab:toy_heterogeneity_r20}). As heterogeneity and subspace dimension vary, SSF demonstrates monotonic improvement as O(d)8 increases, whereas FedSub is unstable and diverges at large O(d)9:


Figure 2: Relative error trajectories at t0; SSF closely tracks SCAFFOLD, while FedSub and FedAvg are less robust.
At t1, FedSub is consistently unstable (divergence/Nan), whereas SSF remains well controlled and within a small multiplicative gap of full-dimensional baselines.
Figure 1: Convergence behavior at medium heterogeneity and high t2 (t3, t4): SSF remains stable; FedSub diverges.
Deep Learning: CIFAR-100 with ResNet-110
On a non-trivial ResNet-110 FL deployment, SSF maintains a 7-8% accuracy gap behind full-dimensional SCAFFOLD, but significantly outperforms FedAvg and FedSub (45.4% vs 34.4%/23.2% test accuracy at epoch 99).
Figure 3: Test accuracy on CIFAR-100: Full-SCAFFOLD is optimal; SSF is second, with substantial gap over FedAvg/FedSub.
The empirical findings corroborate the theoretical predictions: SSF consistently delivers accuracy-efficiency trade-offs unattainable by FedAvg or projection-only FL, and avoids the instability endemic to subspace-only control approaches.
Implications and Future Directions
The SSF methodology is significant for scalable FL in large-model non-IID regimes. Theoretically, it shows that robust heterogeneity correction does not necessitate full-dimensional auxiliary states; dimension reduction and statistical stability can co-exist. Practically, SSF matches SCAFFOLD's convergence rates while reducing all core system costs—computation, memory, and communication—by up to t5.
Potential future directions include:
- Adaptive subspace scheduling: Learning or dynamically selecting subspaces based on client feedback or observed drift.
- Integration with quantization/sparsification: Synergistic systems-efficient design layers with subspace-native corrections.
- Extension to decentralized and hierarchical FL, where communication topology and local constraints further complicate correction protocols.
Conclusion
SSF represents an overview of variance reduction, subspace optimization, and communication-efficient FL. It provides the first native integration of heterogeneity correction and low-dimensional optimization geometry, achieving robust and efficient FL in the large-model, heterogeneous setting. Both theoretical and experimental evidence substantiate its superiority over pure projection or compression alternatives for practical FL deployments.
References
For complete bibliographic context and related works, refer to (2604.25467).