Proximal Policy Optimization

We’re releasing a new class of reinforcement learning algorithms, Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance.

Policy gradient methods⁠(opens in a new window)are fundamental to recent breakthroughs in using deep neural networks for control, fromvideo games⁠(opens in a new window), to3D locomotion⁠(opens in a new window), toGo⁠(opens in a new window). But getting good results via policy gradient methods is challenging because they are sensitive to the choice of stepsize — too small, and progress is hopelessly slow; too large and the signal is overwhelmed by the noise, or one might see catastrophic drops in performance. They also often have very poor sample efficiency, taking millions (or billions) of timesteps to learn simple tasks.

Researchers have sought to eliminate these flaws with approaches likeTRPO⁠(opens in a new window)andACER⁠(opens in a new window), by constraining or otherwise optimizing the size of a policy update. These methods have their own trade-offs—ACER is far more complicated than PPO, requiring the addition of code for off-policy corrections and a replay buffer, while only doing marginally better than PPO on the Atari benchmark; TRPO—though useful for continuous control tasks—isn’t easily compatible with algorithms that share parameters between a policy and value function or auxiliary losses, like those used to solve problems in Atari and other domains where the visual input is significant.

With supervised learning, we can easily implement the cost function, run gradient descent on it, and be very confident that we’ll get excellent results with relatively little hyperparameter tuning. The route to success in reinforcement learning isn’t as obvious—the algorithms have many moving parts that are hard to debug, and they require substantial effort in tuning in order to get good results. PPO strikes a balance between ease of implementation, sample complexity, and ease of tuning, trying to compute an update at each step that minimizes the cost function while ensuring the deviation from the previous policy is relatively small.

We’vepreviously⁠(opens in a new window)detailed a variant of PPO that uses an adaptiveKL⁠(opens in a new window)penalty to control the change of the policy at each iteration. The new variant uses a novel objective function not typically found in other algorithms:

$L^{C L I P} \left(\right. \theta \left.\right) = \left(\hat{E}\right)_{t} \left[\right. m i n \left(\right. r_{t} \left(\right. \theta \left.\right) \left.\right) \left(\hat{A}\right)_{t} , c l i p \left(\right. r_{t} \left(\right. \theta \left.\right) , 1 - \epsilon , 1 + \epsilon \left.\right) \left(\hat{A}\right)_{t} \left.\right) \left]\right.$L C L I P(θ)=E^t[min(r t(θ))A^t,c l i p(r t(θ),1−ε,1+ε)A^t)]

This objective implements a way to do a Trust Region update which is compatible with Stochastic Gradient Descent, and simplifies the algorithm by removing the KL penalty and need to make adaptive updates. In tests, this algorithm has displayed the best performance on continuous control tasks and almost matches ACER’s performance on Atari, despite being far simpler to implement.

## Controllable, complicated robots

We’ve created interactive agents based on policies trained by PPO—we canuse the keyboard⁠(opens in a new window)to set new target positions for a robot in an environment within Roboschool; though the input sequences are different from what the agent was trained on, it manages to generalize.

## Baselines: PPO, PPO2, ACER, and TRPO

This release ofbaselines⁠(opens in a new window)includes scalable, parallel implementations of PPO and TRPO which both use MPI for data passing. Both use Python3 and TensorFlow. We’re also adding pre-trained versions of the policies used to train the above robots to theRoboschool⁠agent zoo⁠(opens in a new window).

Update: We’re also releasing a GPU-enabled implementation of PPO, called PPO2. This runs approximately 3x faster than the current PPO baseline on Atari. In addition, we’re releasing an implementation of Actor Critic with Experience Replay (ACER), a sample-efficient policy gradient algorithm. ACER makes use of a replay buffer, enabling it to perform more than one gradient update using each piece of sampled experience, as well as a Q-Function approximate trained with the Retrace algorithm.

We’re looking for people to help build and optimize our reinforcement learning algorithm codebase. If you’re excited about RL, benchmarking, thorough experimentation, and open source, pleaseapply⁠(opens in a new window), and mention that you read the baselines PPO post in your application.

John Schulman, Oleg Klimov, Filip Wolski, Prafulla Dhariwal, Alec Radford

Scaling laws for reward model overoptimization Publication Oct 19, 2022

Introducing Whisper Release Sep 21, 2022

Learning to play Minecraft with Video PreTraining Conclusion Jun 23, 2022

Our Research * Research Index * Research Overview * Research Residency * Economic Research

Latest Advancements * GPT-5.5 * GPT-5.4 * GPT-5.3 Instant * GPT-5.3-Codex

Safety * Safety Approach * Security & Privacy * Trust & Transparency

ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)

Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)

API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)

For Business * Business Overview * Solutions * Contact Sales

Company * About Us * Our Charter * Foundation(opens in a new window) * Careers * Brand

Support * Help Center(opens in a new window)

More * News * Stories * Academy * Livestreams * Podcast * RSS

Terms & Policies * Terms of Use * Privacy Policy * Other Policies

(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)

English United States

Proximal Policy Optimization

India’s Snabbit seeks fresh funding at a $400M valuation, sources say

DPIIT Unveils Guidelines for Rs 10,000 Crore Startup India Fund of Funds 2.0

Pine Labs acquires D2C SaaS startup Shopflo for Rs 88 Cr

Kunal Gupta Promoted to SVP at Flipkart's Quick Commerce Arm, Minutes

Latest Briefs