Sample Average Approximation for Distributionally Robust Optimization with $φ$-divergences

Published 12 Apr 2026 in math.OC and math.ST | (2604.10855v1)

Abstract: It is well known that estimating the expectation of any given bounded random variable with values in $[-B, B]$ has a sample complexity of $\mathcal{O}(B^{2/ε^2)$} that is independent of the underlying probability measure. We show that this property can no longer hold when evaluating the worst-case expectation of the random variable, when the probability measures defining the expectation belong to the $φ$-divergence ball centered at some nominal measure $P$. Specifically, the sample complexity and its dependence on the nominal measure can be completely characterized by the growth of the divergence function. When the divergence function $φ$ exhibits superlinear growth, then a $P$-independent sample complexity of $\mathcal{O}(M_{φ, τ}(ε) / ε^2)$ can be obtained by sample average approximation. Here $M_{φ, τ}(\cdot) $ is a function that only depends on the growth of $φ$ and the radius $τ$ of the divergence ball. On the other hand, when superlinear growth does not hold for $φ$, we show that for any estimation method, evaluating the worst-case expectation has a $P$-dependent sample complexity lower bound that can be made arbitrarily large by changing $P$.

Abstract PDF Upgrade to Chat

Authors (1)

Yan Li

Summary

The paper identifies a dichotomy in sample complexity based on φ’s growth, with superlinear divergences providing P-independent bounds.
It employs a refined SAA method with truncation and dual regularization to control concentration errors across empirical measures.
The study informs practitioners that superlinear divergences, such as KL-divergence, yield robust and scalable risk estimates in high-dimensional settings.

Sample Average Approximation for Distributionally Robust Optimization with $φ$ -divergences: Sample Complexity Characterization

Introduction and Problem Formulation

This paper investigates the sample complexity of Sample Average Approximation (SAA) for the estimation of worst-case expectations under a class of distributionally robust optimization (DRO) models defined via $φ$ -divergence ambiguity sets. Specifically, for a bounded measurable function $X: \Omega \to [-B,B]$ and a nominal measure $P$ over $(\Omega, \mathcal{F})$ , the DRO functional is

$\mathcal{R}(X) = \sup_{\zeta \in \mathfrak{D}, \int_\Omega \phi(\zeta) dP \le \tau} \int_\Omega X(\omega)\zeta(\omega)dP(\omega)$

where $\phi$ is a convex divergence function with $\phi(1) = 0$ and domain restriction to nonnegative arguments, and $\mathfrak{D}$ is the set of probability densities with respect to $P$ . The SAA is defined analogously using the empirical distribution $φ$ 0 based on $φ$ 1 IID samples.

Whereas in the risk-neutral expectation case the sample complexity is always $φ$ 2 and independent of $φ$ 3, the authors show the situation is fundamentally altered in the DRO context. The main contributions are a sharp characterization—dependent only on the growth rate of $φ$ 4—of when the sample complexity remains nominal-measure-independent, and explicit lower and upper bounds in the respective regimes.

Main Results

Sublinear Versus Superlinear Growth Dichotomy

The analysis hinges on the asymptotic growth property of the divergence function $φ$ 5:

Sublinear Growth: $φ$ 6 has sublinear growth if $φ$ 7.
Superlinear Growth: $φ$ 8 has superlinear growth if $φ$ 9.

Sublinear Growth: Necessarily Nominal-Dependent and Unbounded Sample Complexity

For divergence functions with sublinear growth (e.g., Total Variation), the authors construct instances where, for arbitrary small probability mass assigned by $X: \Omega \to [-B,B]$ 0 to regions where $X: \Omega \to [-B,B]$ 1 is non-zero, the worst-case expectation $X: \Omega \to [-B,B]$ 2 remains bounded below, but the SAA estimator can be zero with substantial probability if the support is not sampled. This implies a fundamental lower bound:

For any estimator, the sample complexity for $X: \Omega \to [-B,B]$ 3-accuracy depends on the “rarity” of events under $X: \Omega \to [-B,B]$ 4 and can be made arbitrarily large, scaling inversely with the nominal probability of sampling these events. This is formalized both for SAA and information-theoretic estimators via Le Cam’s method.

Thus, any estimator for the DRO risk $X: \Omega \to [-B,B]$ 5 cannot, in general, achieve nominal-independence if $X: \Omega \to [-B,B]$ 6 is sublinear.

Superlinear Growth: P-independent Upper Bounds

When $X: \Omega \to [-B,B]$ 7 exhibits superlinear growth (e.g., KL-divergence, $X: \Omega \to [-B,B]$ 8-divergence for $X: \Omega \to [-B,B]$ 9), the authors establish strong P-independent sample complexity upper bounds. Technically, this property ensures that excessively large likelihood ratios are penalized in the ambiguity set, which, in turn, prevents singular concentration to rare events under SAA.

Key elements of the analysis:

The DRO problem is regularized by truncating $P$ 0 to a bounded domain, after which both primal and dual formulations are controlled.
Explicit Lipschitz modulus bounds for the dual functional are established, enabling uniform concentration results with respect to the empirical measure.
For a superlinear divergence $P$ 1, the sample complexity bound for SAA is $P$ 2, where $P$ 3 depends only on the growth of $P$ 4 and the ambiguity radius $P$ 5, independent of $P$ 6.

This result recovers and generalizes prior bounds for specific divergences but provides a comprehensive growth-rate-based criterion.

Technical Approach and Concentration Analysis

The approach includes a delicately justified truncation of the divergence function, yielding a uniformly tight surrogate DRO functional.
Lipschitz continuity of the dual form over bounded regions is derived, which enables the deployment of uniform high-probability deviations for empirical processes.
By carefully balancing truncation error, uniform covering number bounds, and empirical deviations, the overall sample complexity is minimized and specified precisely.
Lower bounds for the sublinear case are shown using explicit two-point constructions.

Theoretical and Practical Implications

These results fully resolve, for general $P$ 7-divergences, the question of when robust risk optimization can be estimated from data with fixed accuracy and confidence independently of the underlying measure. The dichotomy identifies exactly for which classes of divergences this is possible and quantifies the role of tail penalization properties of $P$ 8.

Practically, this means that superlinearly growing divergences are preferable in DRO settings where the underlying support or tail behavior of $P$ 9 cannot be well controlled or is expected to be high-dimensional. For example, KL-robust DRO and square divergence (when $(\Omega, \mathcal{F})$ 0) are sample-efficient even in large-data regimes and continuous support, while total variation DRO is not.

The methods proposed—truncation and dual regularization—suggest algorithmically implementable SAA approaches, with explicit guidance for parameter selection to balance estimation error and computational tractability.

Directions for Future Work

The paper outlines several avenues:

Extension to optimization where the risk envelope is parameterized by decision variables (i.e., stochastic programming or DRO-MDPs), necessitating uniformity over policy classes.
Development of instance-dependent bounds (incorporating $(\Omega, \mathcal{F})$ 1 when available).
Analysis of asymptotic behavior and normality of the SAA estimators, comparing to the risk-neutral regime.

Such work could lead to more scalable and theoretically principled algorithms for large-scale robust control and learning problems, as well as better understanding of the statistical tradeoffs in model risk quantification.

Conclusion

This research provides a rigorous and fine-grained analysis of the sample complexity of SAA for $(\Omega, \mathcal{F})$ 2-divergence-based DRO, revealing an exact growth-rate-based dichotomy for divergence functions. For superlinearly growing divergences, SAA achieves $(\Omega, \mathcal{F})$ 3-independent polynomial sample complexity, while for sublinear divergences, estimation is always fundamentally nominal-dependent. These insights have significant implications for both the theoretical understanding of DRO and the practice of robust data-driven optimization and learning.