Sample Complexity of Transfer Learning: An Optimal Transport Approach

Published 19 May 2026 in stat.ML and cs.LG | (2605.20545v1)

Abstract: Transfer learning is an essential technique for many machine learning/AI models of complex structures such as LLMs and generative AI. The essence of transfer learning is to leverage knowledge from resolved source tasks for a new target task, especially when the sample size $m$ of the training data for the latter is low. In this work, we rigorously analyze the potential benefit of transfer learning in terms of sample efficiency. Specifically, taking an optimal transport viewpoint of transfer learning, we find that when the data dimension $d$ is higher than $3$, the sample complexity for transfer learning is $O(m^{{-(α+1)/d})$,} with $α$ indicating the smoothness of the data distribution, as opposed to the $O(m^{-p/d})$ sample complexity for direct learning with $p$ indicating the smoothness of the optimal target model. Our finding theoretically supports a better sample efficiency for transfer learning, when the target task is optimizing over a family of not-so-smooth models (i.e., highly complex networks with the possible use of non-smooth activation functions). Using image classification as an example, we numerically demonstrate the sample efficiency for transfer learning, that is, in the data hungry regime, the model performance can be significantly improved by transfer learning.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces an optimal transport-based formulation that quantifies transfer learning’s sample complexity benefits over direct learning in high-dimensional settings.
It establishes explicit L2 estimation error bounds, showing that smoother input distributions can mitigate the challenges posed by non-smooth regressors.
Empirical tests in image classification and medical diagnosis validate that transfer learning significantly enhances performance when labeled target data is scarce.

Sample Complexity of Transfer Learning: An Optimal Transport Perspective

Introduction and Motivation

The paper "Sample Complexity of Transfer Learning: An Optimal Transport Approach" (2605.20545) introduces a rigorous framework for quantifying the impact of transfer learning on sample complexity in supervised settings, with a particular focus on high-dimensional tasks and modern deep learning scenarios. Traditional approaches to sample complexity analysis focus on direct learning rates, emphasizing the role of the smoothness of the regression function. This work takes a different path by leveraging optimal transport (OT) theory to analyze transfer learning, thus providing explicit minimax rates and revealing hitherto under-explored dependencies between distributional smoothness, model complexity, and dimension.

Transfer learning is evaluated in the context of scenarios where the source and target tasks are related, specifically when the source task offers either generic or domain-specific feature extractors pretrained on large datasets. Real-world applications in image classification and medical diagnosis—domains where data labeling is expensive—motivate the need for theoretically grounded understanding of sample efficiency gains afforded by transfer learning.

Problem Setup and Optimal Transport Formulation

The authors frame supervised transfer learning as an optimal transport problem. The target is to estimate the regression function $f_T(x) = \mathbb{E}[Y_T|X_T=x]$ for target data $(X_T, Y_T)$ from limited labeled examples. Transfer learning leverages a pretrained source model $f_S$ built on source data $(X_S, Y_S)$ with typically much greater coverage.

The transfer process is formalized using transfer maps $T_{X}^{S}$ (input space) and $T_{Y}^{S \to T}$ (output space), interpreted as optimal transport maps (i.e., Brenier maps), which minimize quadratic costs and align distributions across domains. The learning objective becomes:

$\min_{T_{X}^{S},\, T_{Y}^{S \to T}}\, \mathbb{E}\left[ \ell \left( T_{Y}^{S \to T} \circ f_S \circ T_{X}^{S}(X_T), Y_T \right) \right]$

For quadratic loss, convex optimal transport theory ensures existence, uniqueness (almost everywhere), and regularity of such transfer mappings under common smoothness and log-concavity assumptions.

By contrast, direct learning estimates $f_T$ nonparametrically from $m$ samples of $(X_T, Y_T)$ , and its sample complexity is dictated by the smoothness $(X_T, Y_T)$ 0 of $(X_T, Y_T)$ 1 and the data dimension $(X_T, Y_T)$ 2.

Theoretical Results: Sample Complexity Bounds

The core contribution is the derivation of upper bounds on the $(X_T, Y_T)$ 3 estimation error for the transfer learning estimator, expressed in terms of the smoothness $(X_T, Y_T)$ 4 of the joint source and target data distributions and the ambient dimension $(X_T, Y_T)$ 5. The main findings are:

Direct Learning: Minimizing risk w.r.t. $(X_T, Y_T)$ 6 using $(X_T, Y_T)$ 7 samples yields error $(X_T, Y_T)$ 8, i.e., the minimax nonparametric rate set by the regression function's smoothness $(X_T, Y_T)$ 9 and dimensionality $f_S$ 0.
Transfer Learning (OT-based): With the OT formalism, the error for the transfer estimator behaves as $f_S$ 1 (for $f_S$ 2), where $f_S$ 3 measures the (distributional) smoothness of the data; crucially, the regression function's smoothness $f_S$ 4 drops out.

The optimality and tightness of these rates draw on recent statistical OT theory, with detailed dependence on transfer mapping regularity and log-concavity. In high dimensions ( $f_S$ 5), the rate separation between $f_S$ 6 and $f_S$ 7 can yield substantial improvements when $f_S$ 8—a common situation when the underlying data is smooth but the regression function is non-smooth (e.g., due to non-smooth activations in deep models).

Theoretical implications are summarized as follows:

Scenario	Error Rate	Key Smoothness Parameter
Direct Learning	$f_S$ 9	$(X_S, Y_S)$ 0 (regressor smoothness)
Transfer Learning	$(X_S, Y_S)$ 1	$(X_S, Y_S)$ 2 (distributional)

When the input distributions are smooth (e.g., Gaussian mixtures), and the regressor $(X_S, Y_S)$ 3 is non-smooth (e.g., deep ReLU nets), transfer learning achieves superior sample efficiency.

Numerical Experiments

Two experimental setups are used to validate the theoretical claims:

Image Classification (Office-31)

Transfer learning via ResNet-50 pretrained on ImageNet or source domains is compared with direct learning trained from randomly-initialized weights, across varying fractions of training data. Results in low-sample regimes (as little as 10% of data) show marked improvements in AUROC, accuracy, precision, and sensitivity—all metrics favor transfer learning, with relative improvements of over 100% in precision and sensitivity in the smallest data regime.

Medical Diagnosis: Retinopathy of Prematurity (ROP)

A secondary evaluation uses transfer from diabetic retinopathy (DR) diagnosis (large dataset) to retinopathy of prematurity (data-scarce, high-stakes). Transfer-learned classifiers substantially outperform direct learning at all data scales and exceed 0.9 AUROC and accuracy using only ~10% of available data. In the extreme low-data regime (1%), transfer learning yields sensitivity and precision gains exceeding 46% relative to direct learning.

These empirical findings confirm that transfer learning, interpreted through the OT framework, achieves significant improvements when target data is scarce and the regression function is non-smooth.

Implications and Future Directions

This work provides a precise statistical characterization of when and why transfer learning can yield marked sample complexity reductions. The results are most robust in high-dimensional settings where data distributions are smooth and the regression function is highly complex or non-smooth (deep models with ReLU activations, etc.).

Practical Implications

Model selection: The findings offer formal guidance for practitioners: transfer learning is most advantageous for high-complexity models over smooth domains with scarce labeled data.
Architecture design: Non-smooth activation functions, common in deep learning, align with the work's assumptions, justifying existing empirical heuristics.
Medical and scientific AI: In high-stakes, data-limited settings—e.g., medical image diagnosis—the OT framework can justify transfer-learning-based pipelines that maximize sample efficiency.

Theoretical Implications and Open Problems

OT map regularity: The impact of non-Lipschitz or non-convex transport maps (beyond log-concave distributions) remains an open problem.
Distributional shift: Extension to negative transfer or unreliable source tasks can leverage OT-based ambiguity measures.
Generalization beyond quadratic loss: The framework extends to other strictly convex costs, with implications for domain adaptation and unsupervised pretraining.
Scalable computation of high-dimensional OT maps: Continued progress in fast, scalable OT (e.g., entropic or sliced variants) will facilitate broader practical adoption.

Conclusion

The paper provides a rigorous statistical foundation for the sample efficiency of transfer learning grounded in optimal transport, identifying precise conditions where transfer significantly outperforms direct learning. The results hold substantial consequence for the design and deployment of learning systems in data-constrained, high-dimensional applications. The OT-centric approach opens promising avenues for further advances in statistical learning theory and the principled development of data-efficient AI.

Markdown Report Issue