OpenAI Baselines: DQN

We’re open-sourcing OpenAI Baselines, our internal effort to reproduce reinforcement learning algorithms with performance on par with published results. We’ll release the algorithms over upcoming months; today’s release includes DQN and three of its variants.

Reinforcement learning results are tricky to reproduce: performance is very noisy, algorithms have many moving parts which allow for subtle bugs, and many papers don’t report all the required tricks. By releasing known-good implementations (and best practices for creating them), we’d like to ensure that apparent RL advances never are due to comparison with buggy or untuned versions of existing algorithms.

This post contains some best practices we use for correct RL algorithm implementations, as well as the details of our first release:DQN⁠and three of its variants, algorithms developed by DeepMind.

_Compare to a random baseline:_ in the video below, an agent is taking random actions in the game H.E.R.O. If you saw this behavior in early stages of training, it’d be really easy to trick yourself into believing that the agent is learning. So you should always verify your agent outperforms a random one.

_Be wary of non-breaking bugs_: when we looked through a sample of ten popular reinforcement learning algorithm reimplementations we noticed that six had subtle bugs found by a community member and confirmed by the author. These ranged from mild bugs thatignored gradients on some examples⁠(opens in a new window)or implementedcausal convolutions incorrectly⁠(opens in a new window)to serious ones that reportedscores higher than the true result⁠(opens in a new window).

_See the world as your agent does:_ like most deep learning approaches, for DQN we tend to convert images of our environments to grayscale to reduce the computation required during training. This can create its own bugs: when we ran our DQN algorithm on Seaquest we noticed that our implementation was performing poorly. When we inspected the environment we discovered this was because our post-processed images contained no fish, as this picture shows.

When transforming the screen images into greyscale we had incorrectly calibrated our coefficients for the green color values, which led to the fish disappearing. After we noticed the bug we tweaked the color values and our algorithm was able to see the fish again.

To debug issues like this in the future, Gym now contains aplay⁠(opens in a new window)function, which lets a researcher easily see the same observations as the AI agent would.

_Fix bugs, then hyperparameters_: After debugging, we started to calibrate our hyperparameters. We ultimately found that setting the annealing schedule for epsilon, a hyperparameter which controlled the exploration rate, had a huge impact on performance. Our final implementation decreases epsilon to 0.1 over the first million steps and then down to 0.01 over the next 24 million steps. If our implementation contained bugs, then it’s likely we would come up with different hyperparameter settings to try to deal with faults we hadn’t yet diagnosed.

_Double check your interpretations of papers_: In the DQNNature⁠(opens in a new window)paper the authors write: “We also found it helpful to clip the error term from the update [...] to be between -1 and 1.”. There are two ways to interpret this statement — clip the objective, or clip the multiplicative term when computing gradient. The former seems more natural, but it causes the gradient to be zero on transitions with high error, which leads to suboptimal performance, as found in oneDQN implementation⁠(opens in a new window). The latter is correct and has a simple mathematical interpretation —Huber Loss⁠(opens in a new window). You can spot bugs like these by checking that the gradients appear as you expect — this can be easily done within TensorFlow by usingcompute_gradients⁠(opens in a new window).

The majority of bugs in this post were spotted by going over the code multiple times and thinking through what could go wrong with each line. Each bug seems obvious in hindsight, but even experienced researchers tend to underestimate how many passes over the code it can take to find all the bugs in an implementation.

To get started, run the following:

`1pip install baselines2# Train model and save the results to cartpole_model.pkl3python -m baselines.deepq.experiments.train_cartpole4# Load the model saved in cartpole_model.pkl and visualize the learned policy5python -m baselines.deepq.experiments.enjoy_cartpole`

We’ve also provided trained agents, which you can obtain by running:

`1python -m baselines.deepq.experiments.atari.download_model --blob model-atari-prior-duel-breakout-1 --model-dir /tmp/models2python -m baselines.deepq.experiments.atari.enjoy --model-dir /tmp/models/model-atari-prior-duel-breakout-1 --env Breakout --dueling`

We’ve included aniPython notebook⁠(opens in a new window)showing the performance of our DQN implementations on Atari games. You can compare the performance of our various algorithms such as Dueling Double Q learning with Prioritized Replay (yellow), Double Q learning with Prioritized Replay (blue), Dueling Double Q learning (green) and Double Q learning(red).

AI is an empirical science, where the ability to do more experiments directly correlates with progress. With Baselines, researchers can spend less time implementing pre-existing algorithms and more time designing new ones. If you’d like to help us refine, extend, and develop AI algorithms thenjoin us⁠at OpenAI.

Szymon Sidor, John Schulman

Scaling laws for reward model overoptimization Publication Oct 19, 2022

Introducing Whisper Release Sep 21, 2022

Learning to play Minecraft with Video PreTraining Conclusion Jun 23, 2022

Research * Research Index * Research Overview * Economic Research

Latest Advancements * GPT-5.5 * GPT-5.4 * GPT-5.3 Instant

Safety * Safety Approach * Deployment Safety(opens in a new window) * Security & Privacy * Trust & Transparency

Products * ChatGPT(opens in a new window) * ChatGPT Business(opens in a new window) * ChatGPT Enterprise(opens in a new window) * ChatGPT for Education(opens in a new window) * Codex * Release Notes

API Platform * Overview * API Log In(opens in a new window) * Docs(opens in a new window)

Business * Overview * Solutions * Resources * Contact Sales

Developers * Apps SDK(opens in a new window) * Open Models * Docs(opens in a new window) * Resources(opens in a new window) * Developer Forum(opens in a new window)

Company * About Us * Our Charter * Careers * News

Support * Help Center(opens in a new window)

More * Stories * Academy * Livestreams * Podcast * RSS

Terms & Policies * Terms of Use * Privacy Policy * Other Policies

(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)

English United States

OpenAI Baselines: DQN

How memory tools can make AI models worse

OpenAI Plans Public Offering Within a Year

Zest launches a restaurant discovery app powered by where people actually eat

OpenAI Plans IPO Within the Next Year Amid Strong AI Market Demand

Latest Briefs