Provable Multi-Task Reinforcement Learning: A Representation Learning Framework with Low Rank Rewards

Published 4 Apr 2026 in cs.LG | (2604.03891v1)

Abstract: Multi-task representation learning (MTRL) is an approach that learns shared latent representations across related tasks, facilitating collaborative learning that improves the overall learning efficiency. This paper studies MTRL for multi-task reinforcement learning (RL), where multiple tasks have the same state-action space and transition probabilities, but different rewards. We consider T linear Markov Decision Processes (MDPs) where the reward functions and transition dynamics admit linear feature embeddings of dimension d. The relatedness among the tasks is captured by a low-rank structure on the reward matrices. Learning shared representations across multiple RL tasks is challenging due to the complex and policy-dependent nature of data that leads to a temporal progression of error. Our approach adopts a reward-free reinforcement learning framework to first learn a data-collection policy. This policy then informs an exploration strategy for estimating the unknown reward matrices. Importantly, the data collected under this well-designed policy enable accurate estimation, which ultimately supports the learning of an near-optimal policy. Unlike existing approaches that rely on restrictive assumptions such as Gaussian features, incoherence conditions, or access to optimal solutions, we propose a low-rank matrix estimation method that operates under more general feature distributions encountered in RL settings. Theoretical analysis establishes that accurate low-rank matrix recovery is achievable under these relaxed assumptions, and we characterize the relationship between representation error and sample complexity. Leveraging the learned representation, we construct near-optimal policies and prove a regret bound. Experimental results demonstrate that our method effectively learns robust shared representations and task dynamics from finite data.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper presents a framework for multi-task RL that jointly recovers low rank reward representations and optimal policies.
It combines reward-free exploration with SVD-based estimation to derive provable bounds on sample complexity and cumulative regret.
Empirical results in control and grid maze environments confirm its efficiency over conventional baselines.

Provable Multi-Task Reinforcement Learning with Low Rank Rewards: A Representation Learning Framework

Introduction

The paper "Provable Multi-Task Reinforcement Learning: A Representation Learning Framework with Low Rank Rewards" (2604.03891) presents a framework for Multi-Task Representation Learning in Reinforcement Learning (MTRL-RL). The central premise involves multiple linear MDPs sharing the state-action space and transition kernel, but exhibiting distinct reward functions. The crucial structural assumption is that the reward matrices admit a low-rank latent embedding, facilitating joint representation learning across tasks.

The authors' methodology circumvents conventional assumptions—such as Gaussian feature distributions, incoherence, or access to optimal solutions—by embracing more generic policy-dependent feature distributions. This perspective aligns with practical RL scenarios where idealized assumptions are frequently violated, thus enhancing the applicability of their results.

Problem Formulation

The paper formalizes the multi-task RL setting with $T$ tasks, each corresponding to an MDP $(S, A, \{R_{ht}\}_{h=1}^H, \{P_h\}_{h=1}^H)$ sharing states, actions, and transitions. The reward for each task-adaptive policy is parameterized linearly via

$R_{ht}(s,a) = \langle \theta_{ht}, \psi(s,a) \rangle$

with feature embeddings $\psi(s,a) \in \mathbb{R}^d$ and reward parameters $\theta_{ht}$ . The reward matrices $\Theta_h$ (of shape $T \times d$ ) are assumed to be rank $r \ll d, T$ .

The learning objective is joint estimation of reward parameters and construction of $\epsilon$ -optimal policies $\hat{\Pi}^\star(t)$ for each task. The critical technical goal is robust low-rank recovery of $(S, A, \{R_{ht}\}_{h=1}^H, \{P_h\}_{h=1}^H)$ 0 and corresponding regret guarantees.

Algorithmic Framework

The MTRL-RL algorithm operates in four stages:

Stage 1: Reward-Free RL – Data-collection policies are obtained without reward access, guaranteeing broad state-action exploration.
Stage 2: Exploration Policy Construction – The reward-free policies are refined to maximize feature informativeness, ensuring that collected trajectories yield well-conditioned covariance matrices.
Stage 3: Low-Rank Reward Matrix Estimation – Leveraging the exploration policy, samples are used for joint low-rank estimation via SVD-based techniques, unlike baselines relying on independent or random exploration.
Stage 4: Policy Construction – Estimated rewards inform $(S, A, \{R_{ht}\}_{h=1}^H, \{P_h\}_{h=1}^H)$ 1-optimal policy synthesis for each task, exploiting the learned latent structure.

Theoretical Results

Rigorous sample complexity and regret analyses are provided. The key technical contributions include:

Low-Rank Recovery under Generic Features: Provable bounds are established for reward matrix estimation error and latent subspace distance, even absent Gaussianity or incoherence.
Sample Complexity: For estimation error bounded by $(S, A, \{R_{ht}\}_{h=1}^H, \{P_h\}_{h=1}^H)$ 2 and subspace distance $(S, A, \{R_{ht}\}_{h=1}^H, \{P_h\}_{h=1}^H)$ 3, the number of required samples $(S, A, \{R_{ht}\}_{h=1}^H, \{P_h\}_{h=1}^H)$ 4 scales as $(S, A, \{R_{ht}\}_{h=1}^H, \{P_h\}_{h=1}^H)$ 5, reflecting the impact of feature dimensionality and latent spectral properties.
Regret Bound: Leveraging the learned shared structure, the cumulative regret across $(S, A, \{R_{ht}\}_{h=1}^H, \{P_h\}_{h=1}^H)$ 6 episodes and $(S, A, \{R_{ht}\}_{h=1}^H, \{P_h\}_{h=1}^H)$ 7 tasks satisfies

$(S, A, \{R_{ht}\}_{h=1}^H, \{P_h\}_{h=1}^H)$ 8

highlighting direct dependence on estimation error and dimensionality.

Figure 1: Subspace distance decays rapidly as the number of samples $(S, A, \{R_{ht}\}_{h=1}^H, \{P_h\}_{h=1}^H)$ 9 increases, demonstrating efficient low-rank adaptation ( $R_{ht}(s,a) = \langle \theta_{ht}, \psi(s,a) \rangle$ 0).

Numerical Analysis

Empirical validation is provided in simulated control and grid maze environments, using metrics of subspace distance, reward parameter estimation error, and cumulative regret. The algorithm decisively outperforms baselines:

Random Policy Baseline: Uniform random exploration yields degenerate feature distributions, undermining low-rank recovery.
MoM Baseline: Empirical moment estimators perform poorly without intentional exploration policy design.
Independent Task Baseline: Neglecting shared structure leads to suboptimal sample utilization and higher regret.

Experimental results confirm the theoretical prediction that intentional, reward-free exploration is pivotal for robust representation learning and policy performance.

Figure 2: Estimation error trends as a function of samples reveal fast convergence in reward parameter recovery, favoring joint low-rank MTRL-RL.

Practical and Theoretical Implications

Practically, the proposed framework enables efficient learning in multi-agent or multi-objective settings with intrinsic reward correlations, such as autonomous fleets and industrial automation. Theoretically, the main impact lies in demonstrating effective low-rank matrix recovery in RL under realistic, policy-dependent feature distributions, a departure from restrictive assumptions in prior literature.

The method's provable guarantees and empirical performance suggest that joint task representation learning is essential for scalable, high-dimensional RL applications. Furthermore, the approach provides a foundation for extending multi-task RL to environments where policies must adapt dynamically to varied and complex rewards.

Future Directions

Future developments may focus on:

Extending to Nonlinear MDPs: Incorporating nonlinear reward or transition structures, possibly via kernel or deep latent embeddings.
Integration with Model-Free Algorithms: Embedding the representation learning pipeline within high-performance model-free RL algorithms like PPO/DQN.
Scaling to Large Task Collections: Optimized architectures for large-scale multi-task RL with greater heterogeneity and richer reward correlation structures.

Conclusion

The paper introduces a multi-task RL framework exploiting low-rank reward structures to jointly recover task representations and optimal policies, achieving rigorous sample-efficiency and regret bounds without restrictive idealizations. The results underscore the centrality of reward-aware exploration and demonstrate the critical role of representation sharing in real-world multi-task RL.

Markdown Report Issue