Introducing PFRL: A PyTorch-based Deep RL Library

Published in

PyTorch

6 min readOct 12, 2020

PFRL(“Preferred RL”) is a PyTorch-based open-source deep Reinforcement Learning (RL) library developed by Preferred Networks (PFN). PFN is the company behind the deep learning library Chainer and the ChainerRL library to which PFRL is the PyTorch-based successor.

Goals

Oftentimes, when starting a new research project, obtaining or producing quality baseline algorithms can be the first — and sometimes the main — hurdle to overcome. This is especially the case when aiming to surpass state-of-the-art performance, when sometimes neither public nor private implementations are able to reproduce the current state-of-the-art performance. This issue makes reproducibility an appealing feature in a deep RL library.

Another issue is that many implementations of deep RL agents are standalone or few in number. This often forces users to rewrite code or switch between libraries and APIs if they want to use different algorithms. As such, a desirable characteristic when selecting a deep RL library is comprehensiveness in terms of features and algorithms. This is because working with a comprehensive library ensures that you can use a variety of algorithms and features without having to change libraries or APIs.

Additionally, RL algorithms often use several overlapping features/components that may be easily used in different settings. For example, the experience replay buffer is a feature that recurs across many off-policy deep RL algorithms. Other examples include ε-greedy exploration, dueling architectures, and N-step learning. Researchers and practitioners alike may want to be able to reuse features across algorithms in a modular and flexible way.

Considering these challenges, the main goals of PFRL are to enable reproducible research, to support a comprehensive set of algorithms and features, and to be modular and flexible.

To foster reproducibility, we have a large set of reproducibility scripts, which are individual scripts that are meant to reproduce the training/evaluation settings of the original research papers as closely as possible. We currently have reproducibility scripts for 9 key deep RL algorithms: A3C, DQN, IQN, Rainbow, DDPG, TRPO, PPO, TD3, and SAC. Each algorithm has been reproduced, thoroughly benchmarked, and compared to the performances of the original research papers under the same evaluation protocols.

PFRL’s benchmarked results for Rainbow. We specify the number of seeds ran, provide a summarized comparison against the published results, and provide a full table of results for each environment. PFRL has similar benchmark tables for 9 different algorithms.

With the goal of being comprehensive and flexible, PFRL additionally supports numerous other algorithms and features. Additional algorithms include Persistent Advantage Learning, C51, ACER, A2C, and REINFORCE. Additional features include Noisy networks, Dueling networks, Prioritized Experience Replay, Normalized Advantage Functions, and recurrent network support for most agents. These features can be easily integrated with PFRL’s algorithms. Example scripts utilizing these algorithms and features can be found in the repository.

PFRL Structure

Deep RL algorithms in PFRL are implemented as Agent classes. Each Agent implements act and observe methods. The act method takes an observation as input and returns an action. The observe method takes as input the consequences of the last performed action. This can be used, for example, for updating the agent’s network parameters during training or for updating the agent’s internal recurrent state. PFRL also has the BatchAgent and AsyncAgent classes to support agents that act in multiple environments in parallel (e.g. A2C and A3C, respectively). In essence, an Agent implements the action-selection and network update rules for an algorithm. In fact, anAgent class is virtually synonymous with an implementation of a specific algorithm. The full list of PFRL agents currently includes: A2C, A3C, ACER, AL (Advantage Learning), CategoricalDQN, CategoricalDoubleDQN, DDPG, DQN, DoubleDQN, PAL (Persistent Advantage Learning), DoublePAL, PPO, REINFORCE, SAC, TRPO, TD3, IQN, and DPP (Dynamic Policy Programming).

Example Environments used in PFRL: Atlas Robot (top left), Atari boxing (top right), Mujoco Humanoid (bottom left), Slime Volleyball (bottom right).

Users can customize their own training loops and environments by querying the agent’s act method and executing it in the environment. However, users are recommended to use PFRL’s easily extensible experiments module that handles the Agent-Environment interactions for PFRL agents following the OpenAI Gym Env API (the API which most popular deep RL environments follow). The experiments module provides a number of standard modes of training, evaluation, and logging of experiments. Some helpful features include:

Tracking agent performance statistics
Scheduling evaluations and allowing users to specify separate training and evaluation environments
Managing model saving

A basic experiments training function takes as input an agent and an OpenAI Gym environment (or batch of environments), queries the agent for an action (or actions), and executes it (or them) in the environment(s). Essentially these experimental utilities simplify the management of training/evaluation loops and environment interactions for different types of agents (i.e. Agent, BatchAgent, and AsyncAgent) without additional effort from the user.

While an Agent implementation specifies the learning update rules of an algorithm, it is important to note that the user has a large amount of flexibility in parametrizing an agent with PFRL’s many building blocks, shown in light orange in the diagram at the beginning of this section. There are several ways in which the parametrization can be modified to suit the user’s individual needs. Some examples include:

Explorers: Users can parametrize their agents with one of several explorers, such as ε-greedy exploration or Boltzmann exploration.
Network architectures: PFRL supports any PyTorch Module, which can be chosen by the user and passed to the agent. PFRL also has several pre-defined architectures (i.e. PyTorch networks) that are useful for RL, such as dueling network architectures and certain recurrent architectures. PFRL also supports Noisy networks. Users can easily make a network use noisy exploration (see the Rainbow example below).
Replay Buffers: If supported by the agent, users can pass a prioritized replay buffer to the agent to use Prioritized Experience Replay. Users can also use N-step returns in their agent by passing an N-step replay buffer to the agent (see the Rainbow example below).

PFRL users have the ability to choose amongst the library’s several algorithms and combine them with a multitude of features, which is key for modern deep RL research. For users developing a new algorithm as an Agent, it is advisable to restrict the Agent class implementation to contain the algorithm’s unique update rules and action-selection procedure. The remainder of the implementation can exist outside of the agent, allowing for flexible parametrization.

Rainbow Example

One example that highlights the flexibility of PFRL agents is our implementation of Rainbow. Rainbow is an algorithm developed for Atari games that combines six independent improvements into a single agent: CategoricalDQN + Double updates + Dueling Architecture + Noisy networks + Prioritized Experience Replay + N-step target updates.

In PFRL, this is implemented quite simply as a combination of an Agent with different parametrizations in about a dozen lines.

First, we create a Distributional Q-function with a Dueling architecture (+ Dueling).

Second, we convert the network into a noisy network (+ Dueling + Noisy networks).

Third, we initialize a Prioritized replay buffer with 3-step transitions for 3-step updates (+ Dueling + Noisy networks + Prioritized Experience Replay + N-step target updates).

Finally, we pass these parameters into a CategoricalDoubleDQN agent, which is a CategoricalDQN agent that performs double updates (CategoricalDQN + Double updates + Dueling Architecture + Noisy networks + Prioritized Experience Replay + N-step target updates).

For the full script that reproduces the Rainbow paper’s results, see here.

Looking forward

PFN is co-organizing the 2020 NeurIPS MineRL competition with an organizing committee consisting of members from CMU, AICrowd, OpenAI, Deepmind, and Microsoft Research. In this competition, users need to develop a sample-efficient RL agent to obtain a diamond in Minecraft in limited training time while leveraging human demonstrations. To help competitors get started, PFN will be providing the following baseline agents implemented in PFRL: Rainbow, SQIL, DQfD, and Prioritized Dueling Double DQN.

Looking forward, we have several exciting features and algorithms planned for PFRL, including Hindsight Experience Replay, Munchausen DQN, and a large zoo of pretrained models for our 9 reproduced algorithms. We hope that users find PFRL to be useful, and we look forward to community contributions to PFRL!

For more information on PFRL, check out the following links:

Prabhat Nagarajan is an engineer at Preferred Networks in Tokyo, Japan. At Preferred Networks, Prabhat works on RL applications and is one of the core maintainers of the PFRL library. The author would also like to thank Crissman Loomis, Yasuhiro Fujita, Avinash Ummadisingu, Darshan Thaker, and Brahma Pavse for giving feedback on drafts of this post.

Introducing PFRL: A PyTorch-based Deep RL Library

Goals

PFRL Structure

Rainbow Example

Looking forward

Written by Prabhat Nagarajan