These Are Not All the Features You Are Looking For: A Fundamental Bottleneck in Supervised Pretraining

Published 23 Jun 2025 in cs.LG, cs.AI, and stat.ML | (2506.18221v2)

Abstract: Transfer learning is a cornerstone of modern machine learning, promising a way to adapt models pretrained on a broad mix of data to new tasks with minimal new data. However, a significant challenge remains in ensuring that transferred features are sufficient to handle unseen datasets, amplified by the difficulty of quantifying whether two tasks are "related". To address these challenges, we evaluate model transfer from a pretraining mixture to each of its component tasks, assessing whether pretrained features can match the performance of task-specific direct training. We identify a fundamental limitation in deep learning models -- an "information saturation bottleneck" -- where networks fail to learn new features once they encode similar competing features during training. When restricted to learning only a subset of key features during pretraining, models will permanently lose critical features for transfer and perform inconsistently on data distributions, even components of the training mixture. Empirical evidence from published studies suggests that this phenomenon is pervasive in deep learning architectures -- factors such as data distribution or ordering affect the features that current representation learning methods can learn over time. This study suggests that relying solely on large-scale networks may not be as effective as focusing on task-specific training, when available. We propose richer feature representations as a potential solution to better generalize across new datasets and, specifically, present existing methods alongside a novel approach, the initial steps towards addressing this challenge.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper identifies an inherent 'information saturation bottleneck' where models pretrained on mixed data fail to capture all essential features for targeted tasks.
The paper demonstrates through theoretical analysis and empirical examples that sparsity bias during pretraining causes permanent loss of key features.
The paper proposes rich representation methods, including the efficient 'Time-Cat' approach, to recover lost features and significantly improve transfer performance.

This paper investigates a fundamental limitation in supervised pretraining for transfer learning, where models pretrained on a broad mix of data may fail to transfer effectively to new tasks, even if those tasks are components of the original pretraining mixture (2506.18221). The core issue identified is an "information saturation bottleneck": deep learning models, due to an inherent sparsity bias, tend to encode a subset of features sufficient for the pretraining task and then struggle to learn new, potentially competing, features. This can lead to the permanent loss of critical features necessary for optimal performance on specific downstream tasks or sub-distributions within the pretraining data.

The authors formally define the problem by comparing two training paradigms for a target sub-distribution $P_i$ :

Direct Training: A model is trained directly on $P_i$ , learning both feature extractors $\varphi(X;\theta_i)$ and classifier weights $\gamma_i$ optimized for $P_i$ .
Transfer Learning (via Linear Probing): A model is pretrained on a mixture distribution $P = \sum_j \lambda_j P_j(X,Y)$ (where $P_i$ is one of the $P_j$ ). The feature parameters $\theta$ are frozen, and only the linear classifier weights $\gamma$ are fine-tuned on $P_i$ . The central question is whether features pretrained using $P$ can perform as well as features learned directly for a target $P_i$ . The paper argues that this is often not the case, even under favorable conditions where $P_i$ is part of $P$ .

A key theoretical illustration is provided through a simple counterexample. Imagine a feature extractor capable of learning two features, $\varphi_1(X)$ and $\varphi_2(X)$ . The setup involves four sub-distributions ( $P^{[1]}$ to $P^{[4]}$ ), each optimally classifiable by one of the two features (e.g., $P^{[1]}$ and $P^{[2]}$ by $\varphi_1(X)$ ; $P^{[3]}$ and $P^{[4]}$ by $\varphi_2(X)$ ). When these are combined into a mixture $P$ , an optimal classifier for $P$ (which cannot linearly separate all points) will misclassify the least-weighted point. Critically, this optimal classifier for $P$ can often be sparsely represented using only one of the features ( $\varphi_1$ or $\varphi_2$ ), depending on the mixture coefficients $\lambda_i$ . If a deep network with a sparsity bias learns only $\varphi_1$ from the mixture $P$ , it will subsequently be unable to optimally classify $P^{[3]}$ and $P^{[4]}$ upon transfer, as it has "lost" the feature $\varphi_2$ .

The paper posits that this "information saturation bottleneck" is pervasive in practice and offers several pieces of empirical evidence from existing literature:

Spurious Features: Work by Pezeshki et al. (2021) showed that models learning spurious features (correlated but not causal) can hinder the learning of "core" features. Kirichenko et al. (2023) attempted to recover core features by transferring to a class-balanced subdistribution, but a performance gap remained compared to direct training, suggesting some core features were permanently lost during pretraining on the biased mixture.
Genomic Foundation Models (GFMs): Xu et al. (2025) found that simpler supervised models trained directly on specific genomic tasks often outperformed large GFMs pretrained on broad mixtures, indicating that these GFMs might be missing crucial task-specific features despite their scale.

To address this feature loss, the paper discusses "rich representations" as a potential solution, primarily drawing from Zhang et al. (2022, 2023). The idea is to combine features from multiple models. For instance, concatenating the feature representations of four ResNet50 models (ResNetCat4), each trained independently on ImageNet1K with different random seeds, showed comparable performance on ImageNet1K but significantly improved transfer performance on new datasets (e.g., +9% on iNat18) compared to a single, wider ResNet model of similar parameter count (ResNet50W2). This suggests that individual models learn different subsets of useful features due to factors like initialization or data shuffling, and combining them recovers a richer feature set.

Building on this, the authors propose a novel method for constructing rich representations called "Time-Cat," which aims to improve transfer performance without additional pretraining compute. Instead of concatenating full-length trained models, they concatenate ResNet50 models pretrained on ImageNet1K for shorter durations, ensuring the total pretraining steps remain constant. For example:

Baseline (cat1): 1 ResNet50 trained for 450k steps.
Time-Cat (cat2): 2 ResNet50s, each trained for 200k steps (total 400k steps).
Time-Cat (cat4): 4 ResNet50s, each trained for 100k steps (total 400k steps).
Time-Cat (cat5): 5 ResNet50s, each trained for 80k steps (total 400k steps).

The results showed that Time-Cat models achieved significant improvements in transfer accuracy on datasets like iNat18 (up to +12.5% for cat5 vs. cat1) and CIFAR-100 (up to +6% for cat5 vs. cat1), while maintaining similar performance on ImageNet1K and the same overall pretraining compute budget (or slightly less for cat2, cat4, cat5 compared to the baseline).

Implementation Considerations and Practical Implications:

Feature Saturation: Be aware that pretraining on diverse mixtures doesn't guarantee all useful features for all sub-components are learned. Models might "saturate" and discard features that seem redundant for the mixed task but are vital for specific sub-tasks.
Sparsity Bias: The tendency of SGD and regularization techniques to find sparse solutions can exacerbate this issue. While often beneficial, it can lead to the loss of less dominant but still important features.
Rethinking Foundation Models: Relying solely on scaling up foundation models might not overcome this bottleneck. Task-specific data and training may still be crucial for optimal performance.

Rich Representations as a Strategy:

Ensembling/Concatenation: Concatenating features from multiple models trained with different initializations or data orders can be a practical way to create richer representations.

# Pseudocode for feature concatenation
model1 = train_model(data, seed=1)
model2 = train_model(data, seed=2)
# ...
modelN = train_model(data, seed=N)

def get_features(model, input_data):
    # Remove final classification layer
    feature_extractor = model.penultimate_layer
    return feature_extractor(input_data)

def combined_features(input_data):
    features = [get_features(m, input_data) for m in [model1, model2, ..., modelN]]
    return concatenate(features, axis=-1)

# Train a new classifier on top of combined_features
final_classifier = train_linear_probe(combined_features, target_task_data)

Time-Cat Approach: The paper's "Time-Cat" method offers a compute-efficient way to build rich representations. Instead of training N models for T epochs each, train N models for T/N epochs each and concatenate their features. This could be particularly useful when pretraining budgets are constrained.

# Pseudocode for Time-Cat
total_steps = 400000
num_models = 4
steps_per_model = total_steps // num_models # 100k steps

models = []
for i in range(num_models):
    model = initialize_resnet50()
    train_model_for_steps(model, imagenet_data, steps=steps_per_model, seed=i)
    models.append(model)

# Feature extraction and concatenation as above
# Transfer to downstream tasks

Evaluation on Subgroups: When pretraining on mixed data, evaluate performance not just on the overall mixture but also on distinct sub-distributions to identify if critical features for certain components are being missed.

The paper concludes that supervised transfer methods, while valuable, can permanently lose essential features, limiting generalization. Factors like dataset imbalance or even random seeds can influence which features are learned. The "information saturation bottleneck" suggests that simply scaling models might not be a panacea. The proposed "rich representation" strategies, particularly the compute-efficient "Time-Cat" method, offer a promising direction for recovering lost features and improving transfer learning efficacy.

Markdown Report Issue