Though many approaches were tried, top results all came from tuning or extending existing algorithms such as PPO and Rainbow. There’s a long way to go: top performance was 4,692 after training while the theoretical max is 10,000. These results provide validation that our Sonic benchmark is a good problem for the community to double down on: the winning solutions are general machine learning approaches rather than competition-specific hacks, suggesting that one can’t cheat through this problem.
Over the two-month contest, 923 teams registered and 229 submitted solutions to the leaderboard. Our automated evaluation systems conducted a total of 4,448 evaluations of submitted algorithms, corresponding to about 20 submissions per team. The contestants got to see their scores rise on the leaderboard, which was based on a test set of five low-quality levels that we created using a level editor. You can watch the agents play one of these levels by clicking on theleaderboard entries(opens in a new window).
Because contestants got feedback about their submission in the form of a score and a video of the agent being tested on a level, they could easily overfit to the leaderboard test set. Therefore, we used a completely different test set for the final evaluation. Once submissions closed, we took the latest submission from the top 10 entrants and tested their agents against 11 custom Sonic levels made by skilled level designers. To reduce noise, we evaluated each contestant on each level three times, using different random seeds for the environment. The ranking changed in this final evaluation, but not to a large extent.
The top 5 scoring teams are:
RankTeamScore #1 Dharmaraja 4692 #2 mistake 4446 #3 aborg 4430 #4 whatever 4274 #5 Students of Plato 4269 Joint PPO baseline 4070 Joint Rainbow baseline 3843 Rainbow baseline 3498
Dharmarajatopped the scoreboard during the contest, and the lead remained on the final evaluation;mistakenarrowly won out overaborgfor second place. The top three teams will receive trophies.
Learning curves of the top three teams for all 11 levels are as follows (showing the standard error computed from three runs).
Averaging over all levels, we can see the following learning curves.
Note thatDharmarajaandaborgstart at similar scores, whereasmistakestarts much lower. As we will describe in more detail below, these two teams fine-tuned (using PPO) from a pre-trained network, whereasmistaketrained from scratch (using Rainbow DQN).mistake’s learning curves end early because they timed out at 12 hours.
Dharmaraja is a six-member team including Qing Da, Jing-Cheng Shi, Anxiang Zeng, Guangda Huzhang, Run-Ze Li, and Yang Yu.Qing Da(opens in a new window)and Anxiang Zeng are from the AI team within the search department of Alibaba in Hangzhou, China. In recent years, they have studied how to apply reinforcement learning to real world problems, especially in an e-commerce setting, together withYang Yu(opens in a new window), who is an Associate Professor of the Department of Computer Science at Nanjing University, Nanjing,China.
Dharmaraja’s solution is a variant of joint PPO (described in ourtech report(opens in a new window)) with a few improvements. First, it uses RGB images rather than grayscale; second, it uses a slightly augmented action space, with more common button combinations; third, it uses an augmented reward function, which rewards the agent for visiting new states (as judged by a perceptual hash of the screen). In addition to these modifications, the team also tried a number of things that didn’t pan out:DeepMimic(opens in a new window), object detection throughYOLO(opens in a new window), and some Sonic-specific ideas.
* Get the source code(opens in a new window)
Team mistake consists of Peng Xu and Qiaoling Zhong. Both are second-year graduate students in Beijing, China, studying at the CAS Key Laboratory of Network Data Science and the Technology Institute of Computing Technology, Chinese Academy of Sciences. In their spare time, Peng Xu enjoys playing basketball, and Qiaoling Zhong likes to play badminton. Their favorite video games are Contra and Mario.
Mistake’s solution is based on the Rainbow baseline. They made several modifications that helped boost performance: a better value of n for n-step Q learning; an extra CNN layer added to the model, which made training slower but better; and a lower DQN target update interval. Additionally, the team tried joint training with Rainbow, but found that it actually hurt performance in their case.
* Get the source code(opens in a new window)
Team Aborg is a solo effort from Alexandre Borghi. After completing a PhD in computer science in 2011, Alexandre worked for different companies in France before moving to the United Kingdom where he is a research engineer in deep learning. As both a video game and machine learning enthusiast, he spends most of his free time studying deep reinforcement learning, which led him to take part in the OpenAI Retro Contest.
Aborg’s solution, like Dharmaraja’s, is a variant of joint PPO with many improvements: more training levels from the Game Boy Advance and Master System Sonic games; a different network architecture; and fine-tuning hyper-parameters that were designed specifically for fast learning. Elaborating on the last point, Alexandre noticed that the first 150K timesteps of fine-tuning were unstable (i.e. the performance sometimes got worse), so he tuned the learning rate to fix this problem. In addition to the above changes, Alexandre tried several solutions that did not work: different optimizers,MobileNetV2(opens in a new window), using color images,etc.
* Get the source code(opens in a new window)
The Best Write-up Prize is awarded to contestants that produced high-quality essays describing the approaches they tried.
Now, let’s meet the winners of this prize category.
Dylan currently lives in Paris, France. He is a student in software development atschool 42 in Paris(opens in a new window)). He got into machine learning after watchinga video(opens in a new window)of a genetic algorithm learning how to play Mario a year and a half ago. This video sparked his interest and made him want to learn more about the field. His favorite video games are Zelda Twilight Princess and World of Warcraft.
Oleg Mürk(opens in a new window)hails from the San Francisco Bay Area, but is originally from Tartu, Estonia. During the day, he works with distributed data processing systems as a Chief Architect at Planet OS. In his free time, he burns “too much money” on renting GPUs for running deep learning experiments in TensorFlow. Oleg likes traveling, hiking, and kite-surfing and intends to finally learn to surf over the next 30 years. His favorite computer game (also the only one he has completed) is Wolfenstein 3D. His masterplan is to develop an automated programmer over the next 20 years and then retire.
Felix is an entrepreneur who lives in Hong Kong. His first exposure to machine learning was a school project where he applied PCA to analyse stock data. After several years pursuing entrpreneurship, he got into ML in late 2015; he has become anactive Kaggler(opens in a new window)and has worked on severalside projects(opens in a new window)on computer vision and reinforcement learning.
## Best Supporting Material
One of the best things that came from this contest was seeing contestants helping each other out. Lots of people contributed guides for getting started, useful scripts, and troubleshooting support for other contestants.
The winner of our Best Supporting Material award isTristan Sokol(opens in a new window), who wrotemany(opens in a new window)helpful(opens in a new window)blog(opens in a new window)posts(opens in a new window)throughout the contest andmade a tool(opens in a new window)for visualizing trajectories through Sonic levels.
During the day, Tristan works for Square, helping to build their developer platform; at night, he is a designer and entrepreneur. This was the first time that he has done any AI/ML, and also his first time using Python for any real use case. Looking forward, Tristan is going to try to make cool things with TensorFlow.js. Whenever he isn’t in front of a computer, Tristan is probably in his Oakland backyard watching plants grow.
## Lessons and next steps
Contests have the potential to overhaul the prevailing consensus on what works the best, since contestants will try a diverse set of different approaches and the best one will win. In this particular contest, the top performing approaches were not radically different from the ones that we at OpenAI had found to be successful prior to the contest.
We were glad to see several of the top solutions making use of transfer learning; fine-tuning from the training levels. However, we were surprised to find that some of the top submissions were simply tuned versions of our baseline algorithms. This emphasizes the importance of hyper-parameters, especially in RL algorithms such as Rainbow DQN.
We plan to start another rendition of the contest in a few months. We hope and expect that some of the more off-beat approaches will be successful in this second round, now that people know what to expect and have begun to think deeply about the problems of fast learning and generalization in reinforcement learning. We’ll see you then, and we look forward to watching your innovative solutions climb up the scoreboard.
John Schulman, Vicki Pfau, Alex Nichol, Christopher Hesse, Oleg Klimov, Larissa Schiavo
Thanks to Kevin Wong and Cathy Trang for creating the final evaluation levels.
Frontier risk and preparedness Safety Oct 26, 2023
OpenAI Red Teaming Network Safety Sep 19, 2023
Confidence-Building Measures for Artificial Intelligence: Workshop proceedings Conclusion Aug 1, 2023
Our Research * Research Index * Research Overview * Research Residency * OpenAI for Science * Economic Research
Latest Advancements * GPT-5.3 Instant * GPT-5.3-Codex * GPT-5 * Codex
Safety * Safety Approach * Security & Privacy * Trust & Transparency
ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)
Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)
API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)
For Business * Business Overview * Solutions * Contact Sales
Company * About Us * Our Charter * Foundation * Careers * Brand
Support * Help Center(opens in a new window)
More * News * Stories * Livestreams * Podcast * RSS
Terms & Policies * Terms of Use * Privacy Policy * Other Policies
(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)
OpenAI © 2015–2026 Manage Cookies
English United States