- The paper identifies an inherent 'information saturation bottleneck' where models pretrained on mixed data fail to capture all essential features for targeted tasks.
- The paper demonstrates through theoretical analysis and empirical examples that sparsity bias during pretraining causes permanent loss of key features.
- The paper proposes rich representation methods, including the efficient 'Time-Cat' approach, to recover lost features and significantly improve transfer performance.
This paper investigates a fundamental limitation in supervised pretraining for transfer learning, where models pretrained on a broad mix of data may fail to transfer effectively to new tasks, even if those tasks are components of the original pretraining mixture (2506.18221). The core issue identified is an "information saturation bottleneck": deep learning models, due to an inherent sparsity bias, tend to encode a subset of features sufficient for the pretraining task and then struggle to learn new, potentially competing, features. This can lead to the permanent loss of critical features necessary for optimal performance on specific downstream tasks or sub-distributions within the pretraining data.
The authors formally define the problem by comparing two training paradigms for a target sub-distribution Pi​:
- Direct Training: A model is trained directly on Pi​, learning both feature extractors φ(X;θi​) and classifier weights γi​ optimized for Pi​.
- Transfer Learning (via Linear Probing): A model is pretrained on a mixture distribution P=∑j​λj​Pj​(X,Y) (where Pi​ is one of the Pj​). The feature parameters θ are frozen, and only the linear classifier weights γ are fine-tuned on Pi​.
The central question is whether features pretrained using P can perform as well as features learned directly for a target Pi​. The paper argues that this is often not the case, even under favorable conditions where Pi​ is part of P.
A key theoretical illustration is provided through a simple counterexample. Imagine a feature extractor capable of learning two features, φ1​(X) and φ2​(X). The setup involves four sub-distributions (P[1] to P[4]), each optimally classifiable by one of the two features (e.g., P[1] and P[2] by φ1​(X); P[3] and P[4] by φ2​(X)). When these are combined into a mixture P, an optimal classifier for P (which cannot linearly separate all points) will misclassify the least-weighted point. Critically, this optimal classifier for P can often be sparsely represented using only one of the features (φ1​ or φ2​), depending on the mixture coefficients λi​. If a deep network with a sparsity bias learns only φ1​ from the mixture P, it will subsequently be unable to optimally classify P[3] and P[4] upon transfer, as it has "lost" the feature φ2​.
The paper posits that this "information saturation bottleneck" is pervasive in practice and offers several pieces of empirical evidence from existing literature:
- Spurious Features: Work by Pezeshki et al. (2021) showed that models learning spurious features (correlated but not causal) can hinder the learning of "core" features. Kirichenko et al. (2023) attempted to recover core features by transferring to a class-balanced subdistribution, but a performance gap remained compared to direct training, suggesting some core features were permanently lost during pretraining on the biased mixture.
- Genomic Foundation Models (GFMs): Xu et al. (2025) found that simpler supervised models trained directly on specific genomic tasks often outperformed large GFMs pretrained on broad mixtures, indicating that these GFMs might be missing crucial task-specific features despite their scale.
To address this feature loss, the paper discusses "rich representations" as a potential solution, primarily drawing from Zhang et al. (2022, 2023). The idea is to combine features from multiple models. For instance, concatenating the feature representations of four ResNet50 models (ResNetCat4), each trained independently on ImageNet1K with different random seeds, showed comparable performance on ImageNet1K but significantly improved transfer performance on new datasets (e.g., +9% on iNat18) compared to a single, wider ResNet model of similar parameter count (ResNet50W2). This suggests that individual models learn different subsets of useful features due to factors like initialization or data shuffling, and combining them recovers a richer feature set.
Building on this, the authors propose a novel method for constructing rich representations called "Time-Cat," which aims to improve transfer performance without additional pretraining compute. Instead of concatenating full-length trained models, they concatenate ResNet50 models pretrained on ImageNet1K for shorter durations, ensuring the total pretraining steps remain constant.
For example:
- Baseline (cat1): 1 ResNet50 trained for 450k steps.
- Time-Cat (cat2): 2 ResNet50s, each trained for 200k steps (total 400k steps).
- Time-Cat (cat4): 4 ResNet50s, each trained for 100k steps (total 400k steps).
- Time-Cat (cat5): 5 ResNet50s, each trained for 80k steps (total 400k steps).
The results showed that Time-Cat models achieved significant improvements in transfer accuracy on datasets like iNat18 (up to +12.5% for cat5 vs. cat1) and CIFAR-100 (up to +6% for cat5 vs. cat1), while maintaining similar performance on ImageNet1K and the same overall pretraining compute budget (or slightly less for cat2, cat4, cat5 compared to the baseline).
Implementation Considerations and Practical Implications:
- Feature Saturation: Be aware that pretraining on diverse mixtures doesn't guarantee all useful features for all sub-components are learned. Models might "saturate" and discard features that seem redundant for the mixed task but are vital for specific sub-tasks.
- Sparsity Bias: The tendency of SGD and regularization techniques to find sparse solutions can exacerbate this issue. While often beneficial, it can lead to the loss of less dominant but still important features.
- Rethinking Foundation Models: Relying solely on scaling up foundation models might not overcome this bottleneck. Task-specific data and training may still be crucial for optimal performance.
- Rich Representations as a Strategy:
- Ensembling/Concatenation: Concatenating features from multiple models trained with different initializations or data orders can be a practical way to create richer representations.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
# Pseudocode for feature concatenation
model1 = train_model(data, seed=1)
model2 = train_model(data, seed=2)
# ...
modelN = train_model(data, seed=N)
def get_features(model, input_data):
# Remove final classification layer
feature_extractor = model.penultimate_layer
return feature_extractor(input_data)
def combined_features(input_data):
features = [get_features(m, input_data) for m in [model1, model2, ..., modelN]]
return concatenate(features, axis=-1)
# Train a new classifier on top of combined_features
final_classifier = train_linear_probe(combined_features, target_task_data) |
- Time-Cat Approach: The paper's "Time-Cat" method offers a compute-efficient way to build rich representations. Instead of training N models for T epochs each, train N models for T/N epochs each and concatenate their features. This could be particularly useful when pretraining budgets are constrained.
1
2
3
4
5
6
7
8
9
10
11
12
13
|
# Pseudocode for Time-Cat
total_steps = 400000
num_models = 4
steps_per_model = total_steps // num_models # 100k steps
models = []
for i in range(num_models):
model = initialize_resnet50()
train_model_for_steps(model, imagenet_data, steps=steps_per_model, seed=i)
models.append(model)
# Feature extraction and concatenation as above
# Transfer to downstream tasks |
- Evaluation on Subgroups: When pretraining on mixed data, evaluate performance not just on the overall mixture but also on distinct sub-distributions to identify if critical features for certain components are being missed.
The paper concludes that supervised transfer methods, while valuable, can permanently lose essential features, limiting generalization. Factors like dataset imbalance or even random seeds can influence which features are learned. The "information saturation bottleneck" suggests that simply scaling models might not be a panacea. The proposed "rich representation" strategies, particularly the compute-efficient "Time-Cat" method, offer a promising direction for recovering lost features and improving transfer learning efficacy.