Dynamic Distillation and Gradient Consistency for Robust Long-Tailed Incremental Learning

Published 5 May 2026 in cs.CV | (2605.03364v1)

Abstract: The task of Long-tailed Class Incremental Learning (LT-CIL) addresses the sequential learning of new classes from datasets with imbalanced class distributions. This scenario intensifies the fundamental problem of catastrophic forgetting, inherent to continual learning, with the dual challenges of under-learning minority classes and overfitting majority classes. To tackle these combined issues, this paper proposes two main techniques. First, we introduce gradient consistency regularization, which leverages the moving average of gradients to suppress abrupt fluctuations and stabilize the training process. Second, we dynamically adjust the weight of the distillation loss by measuring the degree of class imbalance with normalized entropy. This adaptive weighting establishes an optimal balance between retaining old knowledge and acquiring new information. Experiments on the CIFAR-100-LT, ImageNetSubset-LT, and Food101-LT benchmarks show that our method achieves consistent accuracy improvements of up to 5.0\%. Furthermore, we demonstrate dramatic gains in the challenging 'In-ordered' setting, where tasks progress from majority to minority classes, highlighting our method's robustness in mitigating forgetting under unfavorable learning dynamics. This enhanced performance is achieved without a significant increase in computational overhead, demonstrating the practicality of our framework.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper presents a novel framework that dynamically adjusts distillation strength using entropy-aware modulation to tackle catastrophic forgetting in LT-CIL.
It integrates a time-based scheduler with entropy scaling to balance new class adaptation and knowledge retention, yielding up to 5.0% accuracy improvements over strong baselines.
The method stabilizes gradient updates at task boundaries through Gradient Consistency Regularization, refining minority class features with minimal computational overhead.

Dynamic Distillation and Gradient Consistency for Robust Long-Tailed Incremental Learning

Problem Formulation and Motivation

Long-tailed class-incremental learning (LT-CIL) poses unique challenges at the intersection of continual learning and imbalanced classification. In contrast to standard class-incremental learning (CIL), LT-CIL exacerbates catastrophic forgetting by introducing highly skewed class distributions, causing under-learning of minority classes and overfitting to majority classes. The paper addresses these dual difficulties by reframing the knowledge retention versus plasticity trade-off as a distribution-sensitive optimization, incorporating techniques to explicitly regularize gradient stability and dynamically adapt distillation strength.

Existing approaches for CIL, such as replay buffers or prototype-based storage, are often infeasible under LT-CIL constraints due to their memory and privacy demands. Furthermore, direct application of classical class-imbalance remedies—such as loss reweighting or synthetic oversampling—can lead to increased interference between classes across tasks, exacerbating forgetting. Prior state-of-the-art methods, including Gradient Reweighting (GR), improve minority class learning by dynamically scaling gradients but suffer from instability due to uncompensated gradient shocks when tasks change.

Proposed Methodology

The framework integrates two principal innovations: Gradient Consistency Regularization (GCR) and entropy-aware dynamic distillation.

Entropy-Aware Dynamic Distillation

Knowledge distillation, a standard tool for mitigating forgetting, often accentuates imbalance by disproportionately preserving majority class knowledge. The paper introduces a dynamic distillation coefficient, $\lambda$ , which is defined as the product of two terms: a time-based scheduler $\lambda_\mathrm{time}$ and an entropy-based scaling $\lambda_\mathrm{entropy}$ . The time-based scheduler smoothly transitions the focus from new class adaptation (early epochs) to knowledge retention (later epochs) within each task using a sigmoid schedule. The entropy-based term uses normalized entropy of the cumulative class distribution to modulate the overall strength of distillation; higher class imbalance suppresses distillation strength, preventing overdominance of majority classes.

Figure 1: Conceptual diagram of the entropy-aware, dynamic distillation coefficient $\lambda$ , integrating time-based and entropy-based scheduling to adaptively regulate knowledge retention during incremental training.

This formulation ensures that, under strong class imbalance, distillation does not overwhelm the parameter updates needed for minority class adaptation, enabling distribution-aware plasticity.

Gradient Consistency Regularization (GCR)

To counteract instability inherent to GR, GCR penalizes abrupt deviations of the current gradient from an exponential moving average of historical gradients. At each iteration, the updated gradient $\mathbf{g}_t'$ is computed as:

$\mathbf{g}_t' = \mathbf{g}_t + \lambda_{\mathrm{GCR}} (\mathbf{g}_t - \bar{\mathbf{g}}_{t-1})$

with the moving average,

$\bar{\mathbf{g}}_t = \beta \bar{\mathbf{g}}_{t-1} + (1 - \beta) \mathbf{g}_t$

where $\lambda_{\mathrm{GCR}}$ and $\beta$ are hyperparameters. This regularization smooths transitions at task boundaries, leading to stable representation updates crucial for minority class feature refinement.

Experimental Protocol

Empirical validation is conducted on CIFAR-100-LT, ImageNetSubset-LT, and Food101-LT, using severe imbalance ratios (up to 100:1) and incremental acquisition of 10 or 20 tasks, with balanced test splits for unbiased evaluation. The methodology is assessed under two class orderings:

Shuffled: Randomly mixed class order across tasks, modeling unpredictable task difficulties.
In-ordered: Tasks sequenced from majority to minority classes, increasing task difficulty over time and exacerbating overfitting to initial classes.
Figure 2: Visualization of Shuffled (top) and In-ordered (bottom) task orderings, highlighting varied degrees of task difficulty and class imbalance exposure.

Both "From Scratch" (random initialization) and "From Half" (pretraining on a subset) protocols are tested. Baselines include iCaRL and PODNet, with and without integration of the new framework, and GR.

Results and Analysis

The proposed framework consistently achieves relative improvements of up to 5.0% accuracy over strong baselines across datasets and protocols, with the most significant gains in In-ordered LT-CIL. The integration with GR (GR+Ours) yields the highest absolute improvements, particularly on minority groups, directly validating the distribution-aware distillation and gradient stabilization effect.

Ablation studies demonstrate the effect of dynamic distillation scheduling. Entropy-aware modulation of the distillation coefficient significantly enhances learning in scenarios with heightened class skew, outperforming fixed or purely temporal scheduling.

Gradient norm analysis reveals that the proposed method reduces the amplitude and abruptness of gradient changes at task boundaries, especially when new classes are first introduced—indicative of improved optimization stability.

Figure 3: Training evolution of average gradient norms, showing that gradient consistency regularization (orange) suppresses spikes at incremental task transitions relative to the standard GR baseline (blue), implying improved optimization smoothness and feature retention.

Efficiency benchmarks confirm that the method adds negligible computational overhead (1.3% training time increase, no inference penalty) and does not require access to stored data between tasks.

Implications and Future Directions

The dissociation of distillation strength from fixed scheduling to a distributionally adaptive rule allows for robust performance under extreme task and class imbalance, advancing LT-CIL towards more realistic and stringent deployment scenarios. Stabilizing gradients via GCR is particularly impactful for minority class retention, addressing a primary pain-point for real-world continual learning systems.

Potential future work includes automatic adjustment of GCR hyperparameters, integration with transformer-based backbones, and extension of entropy-aware loss modulation to other forms of self-regularization in incremental learning. The paper’s results suggest that distribution- and history-aware loss reweighting mechanisms are decisive in overcoming the glass ceiling observed in prior LT-CIL approaches.

Conclusion

The study systematically addresses the intersection of catastrophic forgetting and class imbalance in long-tailed class-incremental learning. By coupling entropy-aware distillation with gradient consistency regularization, it provides a scalable and computation-efficient framework that achieves strong, stable performance across several challenging benchmarks, particularly enhancing accuracy on minority classes without external replay or complex memory footprints (2605.03364).

Markdown Report Issue