Epistemic Robust Offline Reinforcement Learning

Published 8 Apr 2026 in cs.LG | (2604.07072v1)

Abstract: Offline reinforcement learning learns policies from fixed datasets without further environment interaction. A key challenge in this setting is epistemic uncertainty, arising from limited or biased data coverage, particularly when the behavior policy systematically avoids certain actions. This can lead to inaccurate value estimates and unreliable generalization. Ensemble-based methods like SAC-N mitigate this by conservatively estimating Q-values using the ensemble minimum, but they require large ensembles and often conflate epistemic with aleatoric uncertainty. To address these limitations, we propose a unified and generalizable framework that replaces discrete ensembles with compact uncertainty sets over Q-values. %We further introduce an Epinet based model that directly shapes the uncertainty sets to optimize the cumulative reward under the robust Bellman objective without relying on ensembles. We also introduce a benchmark for evaluating offline RL algorithms under risk-sensitive behavior policies, and demonstrate that our method achieves improved robustness and generalization over ensemble-based baselines across both tabular and continuous state domains.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper’s main contribution is the introduction of ERSAC, which replaces ensemble minima with compact uncertainty sets to robustly model epistemic uncertainty in offline RL.
It details the use of structured uncertainty set geometries, including box, convex hull, and ellipsoidal forms, to improve policy safety and efficiency under distributional shift.
Empirical results show significant performance gains, such as up to 75% improvement in normalized returns and nearly six-fold runtime reduction, validating the method across diverse benchmarks.

Epistemic Robust Offline Reinforcement Learning

Problem Setting and Challenges in Offline RL

Offline RL aims to infer effective policies from fixed datasets without further environment interaction. The primary challenge is epistemic uncertainty resulting from incomplete or biased state-action coverage—especially pronounced when the behavior policy systematically avoids certain actions. Standard deep RL algorithms, when naively applied to offline data, tend to overestimate values in out-of-distribution regions, causing unreliable policy improvement and unsafe extrapolation. Ensemble-based methods, such as SAC-N, attempt to address this by employing robust pessimism via the minimum of multiple Q-functions, but they require large ensembles for stable uncertainty quantification and conflate epistemic with aleatoric uncertainty, providing limited nuance for safe decision-making. This context is particularly salient in domains with costly or risky data acquisition (healthcare, industrial control).

Unified Uncertainty Set Approach: ERSAC

The paper introduces Epistemic Robust Soft Actor-Critic (ERSAC), which generalizes ensemble-based conservative value estimation via structured uncertainty sets over Q-values. Discrete ensembles are replaced with compact uncertainty set operators applied to action-wise Q-value vectors, enabling richer modeling of epistemic uncertainty. ERSAC encompasses box, convex hull, and ellipsoidal set geometries, which can be constructed from ensemble samples or directly parameterized using epistemic neural networks (Epinet). The robust Bellman target is defined as a worst-case expectation over the uncertainty set, and the policy update uses the corresponding adversarial Q-vector, enhancing robustness under distributional shift and poor coverage.

ERSAC reduces computational overhead and avoids the coarse pessimism typical of ensembles. Under certain representations (e.g., coordinate-wise box), SAC-N is recovered as a special case. Ellipsoidal uncertainty sets admit closed-form adversarial Q vectors under linear stochastic Epinet heads, further improving efficiency.

Policy Optimization and Algorithmic Details

ERSAC operates within the entropy-regularized RL paradigm, optimizing

$J(\pi) = \mathbb{E}_\pi\left[\sum \gamma^t \big(r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t))\big)\right]$

with Q updates and policy improvement incorporating robust targets:

For the critic, the Bellman backup utilizes

$y(s, a) = r + \gamma \min_{q \in U(s')} \mathbb{E}_{a' \sim \pi(\cdot|s')} [q(a') - \alpha \log \pi(a'|s')]$

where $U(s')$ represents the uncertainty set.

For the actor, the policy gradient is evaluated against the worst-case adversarial Q-vector induced by the uncertainty set, ensuring policy robustness to epistemic variability.

Epinet-based representations define the distribution over Q-vectors parametrically, allowing principled sampling or analytic computation of uncertainty sets, particularly ellipsoids calibrated to cover a proportion of the mass. The stochastic heads encode mean and covariance features, and bootstrapped losses are used for critic optimization.

Experimental Evaluation and Benchmark Design

A risk-aware offline RL benchmark is developed, systematically varying data coverage via dynamic expectile-based behavior policies spanning risk-seeking through risk-averse configurations. This exposes how behavioral bias shapes epistemic uncertainty–and consequently, downstream policy performance.

Comprehensive experiments are conducted across:

Tabular MDPs (Machine Replacement, Riverswim)
Classic control domains (CartPole, LunarLander)
High-dimensional Atari environments

Strong empirical results are achieved:

In tabular domains, structured uncertainty sets (convex hull, ellipsoidal) yield up to 75% higher normalized returns over box-based ensembles in low-data regimes, with robustness under risk-averse policy-induced coverage gaps.
In Gym environments, ellipsoidal variants (with both ensemble and Epinet construction) outperform ensemble minima consistently under data scarcity and distributional shift, and converge faster to optimality as coverage improves.
The Epinet-based ellipsoidal variant matches ensemble-based performance across settings but reduces runtime nearly six-fold in complex domains, demonstrating scalability and efficiency.
In Atari, the Epinet ellipsoidal model ranks among the top three across all environments, reliably handling unsupported state-action pairs and achieving high episodic returns in both reward-sparse and predictable games.

Policy entropy analyses reveal that box-based ensemble minima yield prematurely deterministic, less exploratory policies, while convex and ellipsoidal sets maintain stochasticity, promoting better generalization and avoidance of spurious value extrapolation.

Theoretical Implications and Future Directions

ERSAC offers a principled framework for epistemic robustness in offline RL, generalizing existing ensemble-based methods under uncertainty set geometry. The explicit modeling of uncertainty sets–particularly those sensitive to policy–enables refined control over conservativeness, avoids excessive pessimism in well-supported regions, and adapts as coverage improves.

The construction bridges recent advances in robust optimization and distributionally robust RL, suggesting avenues for integrating richer ambiguity sets and risk-sensitive objectives. Potential future directions include:

Extension to multi-agent and hierarchical settings with epistemic uncertainty coordination
Incorporation of distributionally robust set calibration and theoretical analyses of finite-sample generalization/regret under epistemic uncertainty
Application to cross-domain offline RL with stronger guarantees for safe adaptation

ERSAC demonstrates that structured uncertainty modeling, efficiently executed via Epinet parameterization or ensemble sampling, is a robust foundation for reliable, generalizable, and scalable offline reinforcement learning.

Conclusion

Epistemic Robust Soft Actor-Critic (ERSAC) introduces compact uncertainty sets for Q-value estimation, supplanting ensemble minima with structured pessimism adaptable to coverage bias inherent in offline data. Through both theoretical exposition and strong empirical performance across multiple domains, ERSAC enables robust policy optimization uniquely sensitive to epistemic uncertainty, with demonstrable gains in data-scarce and biased data regimes. Epinet-based ellipsoidal variants further enhance efficiency and scalability. The framework lays the groundwork for principled epistemic robustness in offline RL, with broad applicability and potential for extension to risk-aware, multi-agent, and distributionally robust settings.

Markdown Report Issue