- The paper provides an empirical study contrasting deep reinforcement learning algorithms in continuing tasks with episodic tasks, focusing on factors like environmental resets and reward centering.
- Key findings reveal that algorithms perform significantly worse without resets, predefined resets and associated costs heavily influence performance, and agent-controlled resets can be challenging.
- The study demonstrates that temporal-difference (TD)-based reward centering is effective in improving or maintaining algorithm performance in continuing tasks, addressing issues with large discount factors or common reward offsets.
- The paper examines performance of DDPG, TD3, SAC, PPO, and DQN across Mujoco and Atari-based continuing task testbeds under various reset scenarios.
- Algorithms trained with predefined resets on continuing testbeds can achieve better performance in continuing evaluations compared to those trained on episodic variants.
The paper provides an empirical study of deep reinforcement learning (RL) algorithms in continuing tasks, contrasting them with episodic tasks. Continuing tasks are characterized by ongoing agent-environment interactions without predefined episodes, suiting real-world applications where environment resets are unavailable or agent-controlled.
The study assesses the performance of several deep RL algorithms, including Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), Soft Actor-Critic (SAC), Proximal Policy Optimization (PPO), and Deep Q-Network (DQN), in a suite of continuing task testbeds based on Mujoco and Atari environments. The impact of different reset scenarios—no resets, predefined resets, and agent-controlled resets—is examined, along with the effectiveness of temporal-difference (TD)-based reward centering for improving algorithm performance in continuing tasks.
Key findings from the empirical study include:
- Algorithms perform significantly worse in tasks without resets compared to those with predefined resets. Predefined resets limit the effective state space and help agents escape suboptimal states.
- Algorithms in continuing testbeds with predefined resets learn policies that outperform those in episodic testbed variants when evaluated in the continuing testbeds. This is achieved by choosing actions that yield higher rewards at the cost of more frequent resets.
- Increasing the reset cost reduces the number of resets and can improve overall rewards, indicating that reset costs are solution parameters.
- When agents are given control over resets, performance is sometimes worse than random policies in tasks with predefined resets, suggesting these tasks are challenging.
- All algorithms perform poorly in continuing tasks with large discount factors or shared reward offsets.
The paper also demonstrates the effectiveness of TD-based reward centering across a range of deep RL algorithms. Reward centering addresses challenges posed by large discount factors and common reward offsets by subtracting an estimate of the average-reward rate from all rewards. The study shows that TD-based reward centering improves all tested algorithms and scales to larger tasks.
The paper is structured as follows:
- Introduction: Introduces the distinction between episodic and continuing tasks in RL, highlighting the relevance of continuing tasks in real-world scenarios. It also acknowledges the limited empirical studies on deep RL algorithms in continuing tasks.
- Evaluating Deep RL Algorithms on Continuing Tasks: Details the empirical study conducted on several well-known RL algorithms in a suite of continuing testbeds.
- Testbeds without Resets: Assesses the performance of DDPG, TD3, SAC, and PPO in five Mujoco-based continuing testbeds without resets.
- Testbeds with Predefined Resets: Evaluates both continuous and discrete control algorithms on continuing task testbeds with predefined resets.
- Testbeds where the agent controls Resets: Studies the behavior of algorithms in continuing tasks where predefined resets are unavailable, and the agent decides when to reset.
- Failure to address large discount factors or offsets in rewards: Demonstrates that the performance of the tested continuous control algorithms deteriorates significantly when a large discount factor is used or when all rewards are shifted by a large constant.
- Evaluating Algorithms with Reward Centering: Empirically demonstrates that the TD-based reward centering method improves or maintains the performance of all tested algorithms in the testbeds.
- Conclusions and Limitations: Summarizes the key findings of the empirical study and acknowledges the limitations of the current research.
The paper includes an appendix with additional details of the experiment setup:
- Average-reward rate as the evaluation metric
- Tested Hyperparameter for Algorithms in testbeds without resets or with predefined resets
- Hyperparameters when applied to testbeds with agent-controlled resets
- Applying Reward Centering Methods to the Tested Algorithms
- Additional Evaluation Results of Tested RL Algorithms
- Additional Results of Algorithms with Reward Centering