Learning to summarize with human feedback

Learning to summarize with human feedback

Our models generate summaries that are better than summaries from 10x larger models trained only with supervised learning. Even though we train our models on the Reddit TL;DR dataset, the same models transfer to generate good summaries of CNN/DailyMail news articles without any further fine-tuning. Our techniques are not specific to summarization; in the long run, our goal is to make aligning AI systems with human preferences a central component of AI research and deployment in many domains.

# Human feedback models outperform much larger supervised models and reference summaries on TL;DR

Figure 1: The performance of various training procedures for different model sizes. Model performance is measured by how often summaries from that model are preferred to the human-written reference summaries. Our pre-trained models are early versions of GPT-3, our supervised baselines were fine-tuned to predict 117K human-written TL;DRs, and our human feedback models are additionally fine-tuned on a dataset of about 65K summary comparisons.

Large-scale language models are becoming increasingly capable on NLP tasks. These models are usually trained with the objective of next word prediction on a dataset of human-written text. But this objective doesn’t capture exactly what we want; usually, we don’t want our models to imitate humans, we want them to give high-quality answers. This mismatch is clear when a model is trained to imitate low-quality human-written text, but it can also happen in more subtle ways. For example, a model trained to predict what a human would say might make up facts when it is unsure, or generate sentences reflecting harmful social bias, both failure modes that have been well-documented.3, 4, 5, 6

As part of our work on safety, we want to develop techniques that align our models’ objectives with the end behavior we really care about. As our models become more powerful, we believe aligning them with our goals will be very important to ensure they are beneficial for humans. In the short term, we wanted to test if human feedback techniques could help our models improve performance on useful tasks.

We focused on English text summarization, as it’s a challenging problem where the notion of what makes a “good summary” is difficult to capture without human input. We apply our method primarily to an existing dataset1of posts submitted to the social network RedditBtogether with human-written “TL;DRs,” which are short summaries written by the original poster.

We first train a reward model via supervised learning to predict which summaries humans will prefer.AWe then fine-tune a language model with reinforcement learning (RL) to produce summaries that score highly according to that reward model. We find that this significantly improves the quality of the summaries, as evaluated by humans, even on datasets very different from the one used for fine-tuning.

Our approach follows directly fromour previous work⁠on learning from human feedback.7There has also been other work on using human feedback to train summarization models.8We push the technique further by scaling to larger models, collecting more feedback data, closely monitoring researcher-labeler agreement, and providing frequent feedback to labelers. Human feedback has also been used to train models in several other domains, such as dialogue,9, 10, 11semantic parsing,12translation,13, 14story15and review16generation, evidence extraction,17and more traditional RL tasks.18, 19

Post from Reddit (r/AskReddit)

## To pursue a Computer Sc. PhD or continue working? Especially if one has no real intention to work in academia even after grad school ...

Rather self-explanatory post, but just to elaborate a little further...

Has anyone, after working for a period of time, decided for whatever reasons to head back into academia to pursue a PhD in Computer Sc, with no intention to join the world of academia but intend to head back into the industry? If so, what were the reasons. Also, how did it turn out? Was there anything that you regretted? Did it work out?

The reason I’m asking for advice is because currently I’m employed full-time, and know how fortunate I am to have a job in these hard times. However, being from a foreign country and working in the UK, they are making it difficult for one to extend one’s working visa because I do not possess a Masters/PhD and only a Bachelor’s.

It’s just a little frustrating because I have been here almost half a decade now, but I have been sitting on a visa that has no count towards residency and soon, I will have to apply for a visa with even more restrictions.

I’m thus considering boosting up my paper qualifications, so as to prevent myself from being in such a situation ever again. I believe the stipend from grad school actually pays almost as much as I am earning now (in terms of spending power, not direct currency conversion), but of course, given if I spent 5 years working as opposed to studying, there might be opportunities for raises/promotions.

Obviously, plunging into a PhD solely for that reason is bound to get looks of disapproval. I am, of course, interested in expanding my knowledge and having an opportunity to have time for myself. I have had research experience (published a paper, presented at conferences) so I’m aware of the challenges research presents.

Any advice from people who have gone through similar situations? Would love to hear from you.

Human-written reference summary

Software Engineer with a job I’m happy at (for now), deciding whether to pursue a PhD to improve qualifications and explore interests and a new challenge.

Human feedback 6B model

Currently employed, considering pursuing PhD in Computer Science to avoid being stuck with no residency visa ever again. Has anyone pursued a PhD purely for the sake of research, with no intention of joining the academic world?

Have no intention to work in the industry after grad school, but want to pursue a PhD. What are the reasons? How did it turn out?

Pre-trained 6B model

I’m considering pursuing a PhD in Computer Science, but I’m worried about the future. I’m currently employed full-time, but I’m worried about the future.

We evaluated several different summarization models—some pre-trained on a broad distribution of text from the internet, some fine-tuned via supervised learning to predict TL;DRs, and some fine-tuned using human feedback.D To evaluate each model, we had it summarize posts from the validation set and asked humans to compare their summaries to the human-written TL;DR. The results are shown inFigure 1⁠.

We found that RL fine-tuning with human feedback had a very large effect on quality compared to both supervised fine-tuning and scaling up model size. In particular, our 1.3 billion parameter (1.3B) model trained with human feedback outperforms our 12B model trained only with supervised learning. Summaries from both our 1.3B and 6.7B human feedback models are preferred by our labelers to the original human-written TL;DRs in the dataset.E

People make different trade-offs when writing summaries, including between conciseness and coverage of the original text; depending on the purpose of the summary, different summary lengths might be preferred. Our labelers tended to prefer longer summaries, so our models adapted to that preference and converged to the longest allowable length. Controlling for length reduced human preferences for our 6.7B model’s summaries from 70% to 65%, explaining a minority of our gains.F

# Human feedback models trained on Reddit transfer to generate excellent summaries of CNN/DM news articles without further training

The performance (human-rated summary quality on a 1–7 scale) of various training procedures and model sizes. Note that our human feedback models generate summaries that are significantly shorter than summaries from models trained on CNN/DM.

At a given summary length, our 6.7B human feedback model trained on Reddit performs almost as well as a fine-tuned 11B T5 model, despite not being re-trained on CNN/DM.

Article from CNN/DM (dailymail)

## Mom of nation’s first all-girl quintuplets Danielle Busby on moment she held the first baby

The nation’s first ever set of all-girl quintuplets is ‘doing fabulous’, according to their mother.

Danielle Busby looked tired but ecstatic as she appeared on Today for an interview just a week after giving birth.

Olivia Marie, Ava Lane, Hazel Grace, Parker Kate and Riley Paige are all still in an incubator at Houston’s Woman’s Hospital of Texas.

But they’re getting stronger every day.

’They’re doing fabulous. They’re so precious and they’re doing really, really well,’ Danielle said as she beamed holding her husband Adam’s hand.

SCROLL DOWN FOR VIDEO

Ecstatic: Danielle and Adam Busby said they are still in shock but delighted a week after the birth

’It was an emotional downpour’: Danielle described holding the first of her five all-girl babies

A journey ahead: Adam admitted it was ‘overwhelming’ changing the babies ‘because they’re so tiny’

Half a dozen girls: Along with their older daughter, Blayke (pictured), they now have six girls

Delivered in only four minutes via c-section, it is the first time the globe has seen all-girl quintuplets since 1969.

As the girls are still building strength, she has only managed to hold two of them so far.

But the experience was phenomenal.

’It was an emotional downpour,’ she said. ‘But that feeling when you finally get to hold your baby, there’s no better feeling than that.’

Adam admitted the logistics of parenting the five infants, as well as their older daughter Blayke, are still a mystery.

’I’m still getting used to changing them in an incu,’ he explained. ‘It’s pretty overwhelming just because they’re so tiny.’

Alongside them sat their doctors, who revealed the girls are gearing up for their first breastmilk feeding, which will be a major milestone.

And as for the mother, she’s feeling fine.

’Surprisingly I feel pretty good!’ she laughed. ‘It’s a bit of a shock!’

Speaking shortly after the birth last week, Danielle told KHOU. ‘We are so thankful and blessed. And I honestly give all the credit to my God.

’I am so thankful for this wonderful hospital and team of people here, they truly all are amazing.’

Dr Alexander Reiter, who delivered the babies at 28 weeks and two days, said the prognosis is good.

’Danielle is a trooper, and her positive attitude definitely shined through the pregnancy and the delivery,’ said Dr Reiter to KHOU.

Pride: Danielle and Adam Busby with one of their five new little girls at Houston’s Woman’s Hospital of Texas after Danielle delivered her quintuplets by C-section on April 7

Welcome to the world: Babies Riley Paige and Parker Kate (right) in their NICU incubators at Houston’s Woman’s Hospital of Texas

Little bit: Hazel Grace is the smallest of the five little girl’s born and was squished by her sisters inside her mother’s womb during the 28-week pregnancy

Sisters: Ava Lane (left) and Olivia Marie (right) sleeping peacefully after being delivered by C-section on April 7

Mom and dad: Adam and Danielle tend to one of their new daughters. All five girls are doing well and only need minimal amounts of assistance breathing after being born at 7 months last week

Dr Reiter told KHOU that he delivered the baby girls with a team of a dozen other medical personnel including seven board-certified neonatologists.

So far the quints are doing well and are getting only ‘modest support’ to breathe.

’They were all in the appropriate weight range for their gestational age, which is due to their mother’s excellent efforts to have the healthiest pregnancy possible,’ said Dr Finkowski-Rivera to KHOU.

’They have made a strong start in life, and we are excited to watch them grow and mature.’

Born via IVF, the Busbys believe that their new quintuplets and three-year-old daughter Blayke are their miracles.

Expanding family: Adam and Danielle already have a three-year-old daughter, Blayke (pictured) who they welcomed into the world via IVF therapy after thinking they might not be able to have children

’Never would we have imagined the plan that God had for us in our journey to grow our little family.’

Speaking in a video posted to their blog on Sunday, the new parents said they were ‘overwhelmed and extremely busy’ after their daughters’ births.

’Riley Paige is the feisty one, always kicking around, always punching around and she is pretty much typically that way,’ said Danielle.

’Parker, who was on my upper side on my right side, is just so relaxed. She is content and calm.

’Hazel, is the little bit. She was squished the whole time, so she loves to just keep her arms by her face, just so cute. She has a full head of blonde hair.

’The twins, Olivia Marie, Ava Lane are getting a little fesity and have a wild streak.’

’They are all doing pretty well.’

Ready to pop: Danielle gave birth two days after the picture (left) and (right) is prepping for her pre-arranged C-section with her husband Adam

Meet your new sister: Blayke is introduced to one of her five new sisters by mother Danielle at the hospital over the weekend

Beaming dad, Adam said that the main thing for him and the medical team was to keep a careful eye on his little girls over the coming weeks.

’We are gonna need to monitor them,’ said Adam on the blog, Itsabuzzworld, that he and Danielle have used to document the pregnancy.

When big sister Blayke met her new siblings, the Busby’s said their hearts felt like bursting.

’It was so cute, she had her little ‘Big Sister’ shirt on,’ said Danielle.

’We prepared her to see all the tubes in the NICU but she was so great.’

’She did so well and just smiled. She didn’t really say much but she just smiled.’

Peek-a-boo: Peering into one of the NICU incubators, Blayke stares at her new sister Riley at the Woman’s Hospital of Texas

Looking forward, the Busby’s admitted the future is going to fun, but tough.

’This week will be an interesting transition,’ said Adam.

’We have a lot of changes coming up in the next few days....trying to work out what our new norm is going to be.’

To help with the challenges ahead of them, the Busby’s have set up a gofundme page.

’One big item that we are going to have to be looking for in the coming weeks is a van,’ write Adam and Danielle.

’We don’t have a vehicle big enough to bring them home. We are looking at our options, but the Nissan NV passenger van and Ford Transit wagon are on the top of the list so far.’

Human-written reference summary

Danielle and Adam Busby welcomed five girls into the world last week. She has only held two but described the feeling as ‘amazing’ . Born at Houston’s Woman’s Hospital of Texas the babies are healthy. They are the first set of all-girl quintuplets born in the us and the first globally since 1969. Team of 12 doctors helped to deliver the babies by C-section. Delivery was at 28 weeks and took the team less than four minutes .

Human feedback 6B model (transfer)

The world has never seen all-girl quintuplets before, and the Busby family is doing great a week after the birth. The girls are getting stronger every day.

Supervised 6B model (transfer)

The world’s first all-girl quintuplets are doing well and are due to be released from Hospital in a few weeks.

Pre-trained 6B model

Danielle Busby gave birth to five all-girl quintuplets at 28 weeks and two days via C-section. The girls are now in the NICU and are doing well.

T5 11B model (fine-tuned on CNN/DM)

Olivia Marie, Ava Lane, Hazel Grace, Parker Kate and Riley Paige are all still in an incubator at Houston’s Woman’s Hospital of Texas. Delivered in only four minutes via C-section, it is the first time the globe has seen all-girl quintuplets since 1969. Danielle and Adam Busby said they are still in shock but delighted a week after the birth.

Supervised 6B model (fine-tuned on CNN/DM)

Danielle Busby gave birth to her quintuplets on April 7. They are the first all-girl quintuplets born in the united states since 1969. The girls are in an incubator at Houston’s Woman’s Hospital of Texas. They are getting stronger every day and are gearing up for their first breastmilk feeding.

To test our models’ generalization, we also applied them directly to the popular CNN/DM news dataset.2These articles are more than twice as long as Reddit posts and are written in a very different style. Our models have seen news articles during pre-training, but all of our human data and RL fine-tuning was on the Reddit TL;DR dataset.

This time we evaluated our models by asking our labelers to rate them on a scale from 1–7.G We discovered that our human feedback models transfer to generate excellent short summaries of news articles without any training. When controlling for summary length, our 6.7B human feedback model generates summaries that are rated higher than the CNN/DM reference summaries written by humans. This suggests that our human feedback models have learned something more general about how to summarize text, and are not specific to Reddit posts.

A diagram of our method, which is similar to the one used in our previous work⁠(opens in a new window).

Our core method consists of four steps: training an initial summarization model, assembling a dataset of human comparisons between summaries, training a reward model to predict the human-preferred summary, and then fine-tuning our summarization models with RL to get a high reward.

We trained several supervised baselines by starting from GPT‑style transformer models trained on text from the Internet,20⁠and fine-tuning them to predict the human-written TL;DR via supervised learning. We mainly use models with 1.3 and 6.7 billion parameters. As a sanity check, we confirmed that this training procedure led to competitive resultsH on the CNN/DM dataset.

We then collected a dataset of human quality judgments. For each judgment, a human compares two summaries of a given post and picks the one they think is better.I We use this data to train a reward model that maps a _(post, summary)_ pair to a reward _r_. The reward model is trained to predict which summary a human will prefer, using the rewards as logits.

Finally, we optimize the policy against the reward model using RL. We usePPO⁠with 1 million episodes in total, where each episode consists of the policy summarizing a single article and then receiving a reward _r_. We include a KL penalty that incentivizes the policy to remain close to the supervised initialization.

## Collecting data from humans

Any training procedure that uses human feedback is directly influenced by the actual humans labeling the data. In our previous work on fine-tuning language models from human preferences,7⁠our labelers often gave high ratings to summaries we thought were average, which was reflected in the quality of our trained models.

In response, in this project we invested heavily in ensuring high data quality. We hired about 80 contractors using third-party vendor sites,Jand paid them an hourly wage regardless of the number of summaries evaluated.K Hiring contractors rather than relying on crowdsourcing websites allowed us to maintain a hands-on relationship with labelers: we created an onboarding process, developed a website with a customizable labeler interface, answered questions in a shared chat room, and had one-on-one video calls with labelers. We also made sure to clearly communicate our definition of summary quality, after spending significant time reading summaries ourselves, and we carefully monitored agreement rates between us and labelers throughout the project.

## Optimizing the reward model

# Optimizing our reward model eventually leads to sample quality degradation

Starting from the 1.3B supervised baseline (point 0 on the x-axis), we use RL to optimize the policy against the reward model, which results in policies with different “distances” from the baseline (x-axis, measured using the KL divergence from the supervised baseline). Optimizing against the reward model initially improves summaries according to humans, but eventually overfits, giving worse summaries. This chart uses an older version of our reward model, which is why the peak of the reward model is less than 0.5.

Post from Reddit (r/AskReddit)

## I’m a 28yo man, and I would like to get into gymnastics for the first time.

Title said just about all of it. I’m 28, very athletic (bike/ surf/ snowboard) and I have always wanted to do gymnastics.

I like to do flips and spins off bridges and on my snowboard, and it seems to me gymnastics would be a great way to do those movements I like, in a controlled environment. The end goal of this is that it would be fun, and make me better at these movements in real life.

But is it too late for me? Should 28 year old guys such as myself be content with just watching those parkour guys on youtube? Or can I learn the ways of the gymnastic jedi? BTW, I live in San Jose CA.

I want to do gymnastics, but I’m 28 yrs old. Is it too late for me to be a gymnaste?!

28yo guy would like to get into gymnastics for the first time. Is it too late for me given I live in San Jose CA?

28yo dude stubbornly postponees start pursuing gymnastics hobby citing logistics reasons despite obvious interest??? negatively effecting long term fitness progress both personally and academically thoght wise? want change this dumbass shitty ass policy pls

Optimizing against our reward model is supposed to make our policy align with human preferences. But the reward model is only a proxy for human preferences, as it only sees a small amount of comparison data from a narrow distribution of summaries. While the reward model performs well on the kinds of summaries it was trained on, we wanted to know how much we could optimize against it until it started giving useless evaluations.

We trained policies at different “optimization strengths” against the reward model, and asked our labelers to evaluate the summaries from these models. We did this by varying the KL coefficient, which trades off the incentive to get a higher reward against the incentive to remain close to the initial supervised policy. We found the best samples had roughly the same predicted reward as the 99th percentile of reference summaries from the dataset. Eventually optimizing the reward model actually makes things worse.

If we have a well-defined notion of the desired behavior for a model, our method of training from human feedback allows us to optimize for this behavior. However, this is not a method for determining what the desired model behavior _should be_. Deciding what makes a good summary is fairly straightforward, but doing this for tasks with more complex objectives, where different humans might disagree on the correct model behavior, will require significant care. In these cases, it is likely not appropriate to use researcher labels as the “gold standard”; rather, individuals from groups that will be impacted by the technology should be included in the process to define “good” behavior, and hired as labelers to reinforce this behavior in the model.

We trained on the Reddit TL;DR dataset1⁠because the summarization task is significantly more challenging than on CNN/DM. However, since the dataset consists of user-submitted posts with minimal moderation, they sometimes contain content that is offensive or reflects harmful social biases. This means our models can generate biased or offensive summaries, as they have been trained to summarize such content.

Part of our success involves scaling up our reward model and policy size. This requires a large amount of compute, which is not available to all researchers: notably, fine-tuning our 6.7B model with RL required about 320 GPU-days. However, since smaller models trained with human feedback can exceed the performance of much larger models, our procedure is more cost-effective than simply scaling up for training high-quality models on specific tasks.

Though we outperform the human-written reference summaries on TL;DR, our models have likely not reached human-level performance, as the reference summary baselines for TL;DR and CNN/DM are not the highest possible quality. When evaluating our model’s TL;DR summaries on a 7-point scale along several axes of quality (_accuracy_,_coverage_,_coherence_, and _overall_), labelers find our models can still generate inaccurate summaries, and give a perfect _overall_ score 45% of the time.L For cost reasons, we also do not directly compare to using a similar budget to collect high-quality demonstrations, and training on those using standard supervised fine-tuning.

## Future directions

We’re interested in scaling human feedback to tasks where humans can’t easily evaluate the quality of model outputs. For example, we might want our models to answer questions that would take humans a lot of research to verify; getting enough human evaluations to train our models this way would take a long time. One approach to tackle this problem is to give humans tools to help them evaluate more quickly and accurately. If these tools use ML, we can also improve them with human feedback, which could allow humans to accurately evaluate model outputs for increasingly complicated tasks.23

In addition to tackling harder problems, we’re also exploring different types of feedback beyond binary comparisons: we can ask humans to provide demonstrations, edit model outputs to make them better, or give explanations as to why one model output is better than another. We’d like to figure out which kinds of feedback are most effective for training models that are aligned with human preferences.

_If you are interested in working on these research questions,__we’re hiring_⁠_!_

1. 1 For training, we use the Reddit TL;DR dataset instead of the more popular CNN/DM dataset because simple copying baselines perform better than the human-written reference summaries on CNN/DM, which is not the case for TL;DR (see Appendix D of our paper). We performed a new web crawl to increase the TL;DR dataset size, required summaries to be between 24 and 48 tokens, and performed some other cleaning and filtering

2. B We hire human labelers to judge summary quality, and implement quality control to ensure that labeler judgments agree with our own. We describe our human data collection procedure below.

3. C Interestingly, we found that human evaluators preferred the Lead-3 baseline (taking the first 3 sentences of the article) to the dataset’s reference summaries, and we confirmed this ourselves.

4. D We generate all of our samples at temperature 0, which we found humans preferred most.

5. E While we use human-written TL;DRs as our main point of comparison, they don’t always represent optimal human performance; they are sometimes intended to be funny or to summarize only a part of the post, and their grammar and style are all over the map.

6. F We control by training a logistic regression model to predict the preferred summary given only the policy ID and the log ratio of the lengths of the summaries. Then, we report the regression coefficients on each policy ID, corresponding to a length ratio of 1 with the reference summaries.

7. G We took this approach because it is hard to directly compare our TL;DR-trained models to models trained on CNN/DM; the CNN/DM summaries are much longer and written in bullet-point form.

8. 21 In terms of ROUGE results on CNN/DM, our 6.7B supervised models are a bit worse than T5, but a bit better than state-of-the-art models from mid-2019.

9. I Our main models are trained on about 65K comparisons, though we achieve good results with as few as 8K comparisons.

10. J Specifically, we use Upwork, Scale, and Lionbridge. Our contractors have a range of ages, genders, and educational backgrounds, and are mostly American or Filipino (see Appendix C of our paper for demographic data).

11. K Our criteria for hiring contractors were: (1) they were willing to do the task, and (2) they passed a minimum threshold of speed and agreement with researcher labels. We paid all our contractors at least $15/hr.

12. L This is impressive relative to the TL;DR reference summaries, which get a perfect _overall_ score 23% of the time, but indicates there is still room for improvement.

1. 1 Völske, M., Potthast, M., Syed, S., & Stein, B. (2017). “TL; DR: Mining reddit to learn automatic summarization⁠(opens in a new window).” In Proceedings of the Workshop on New Frontiers in Summarization 2017.

2. 2 Hermann, K. M., Kocisky, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., & Blunsom, P. (2015). ”Teaching machines to read and comprehend⁠(opens in a new window).” In Advances in neural information processing systems 2015.

3. 3 Maynez, J., Narayan, S., Bohnet, B., & McDonald, R. (2020). “On Faithfulness and Factuality in Abstractive Summarization.⁠(opens in a new window).” arXiv preprint.

4. 4 Sheng, E., Chang, K. W., Natarajan, P., & Peng, N. (2019). “The woman worked as a babysitter: On biases in language generation⁠(opens in a new window).” arXiv preprint.

5. 5 Bordia, S., & Bowman, S. R. (2019). “Identifying and reducing gender bias in word-level language models⁠(opens in a new window).” arXiv preprint.

6. 6 Nadeem, M., Bethke, A., & Reddy, S. (2020). “StereoSet: Measuring stereotypical bias in pretrained language models⁠(opens in a new window).” arXiv preprint.

7. 7 Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., & Irving, G. (2019). “Fine-tuning language models from human preferences⁠(opens in a new window).” arXiv preprint.

8. 8 Böhm, F., Gao, Y., Meyer, C. M., Shapira, O., Dagan, I., & Gurevych, I. (2019). “Better rewards yield better summaries: Learning to summarise without references⁠(opens in a new window).” arXiv preprint.

9. 9 Jaques, N., Ghandeharioun, A., Shen, J. H., Ferguson, C., Lapedriza, A., Jones, N., Gu, S., & Picard, R. (2019). “Way off-policy batch deep reinforcement learning of implicit human preferences in dialog⁠(opens in a new window).” arXiv preprint.

10. 10 Yi, S., Goel, R., Khatri, C., Cervone, A., Chung, T., Hedayatnia, B., ... & Hakkani-Tur, D. (2019). “Towards coherent and engaging spoken dialog response generation using automatic conversation evaluators⁠(opens in a new window).” arXiv preprint.

11. 11 Hancock, B., Bordes, A., Mazare, P. E., & Weston, J. (2019). “Learning from dialogue after deployment: Feed yourself, chatbot!⁠(opens in a new window).” arXiv preprint.

12. 12 Lawrence, C., & Riezler, S. (2018). “Improving a neural semantic parser by counterfactual learning from human bandit feedback⁠(opens in a new window).” arXiv preprint.

13. 13 Kreutzer, J., Khadivi, S., Matusov, E., & Riezler, S. (2018). “Can Neural Machine Translation be Improved with User Feedback?⁠(opens in a new window).” arXiv preprint.

14. 14 Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., ... & Bengio, Y. (2016). “An actor-critic algorithm for sequence prediction⁠(opens in a new window).” arXiv preprint.

15. 15 Zhou, W., & Xu, K. (2020). “Learning to Compare for Better Training and Evaluation of Open Domain Natural Language Generation Models⁠(opens in a new window).” In AAAI 2020.

16. 16 Cho, W., & Zhang, P., & Zhang, Y., & Li, X., & Galley, M., & Brockett, C., & Wang, M., & Gao, J. (2018). “Towards coherent and cohesive long-form text generation.⁠(opens in a new window)” arXiv preprint.

17. 17 Perez, E., & Karamcheti, S., & Fergus, R., & Weston, J., & Kiela, D., & Cho, K. (2019). ”Finding generalizable eevidence by learning to convince Q&A models.⁠(opens in a new window)” arXiv preprint.

18. 18 Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). “Deep reinforcement learning from human preferences⁠(opens in a new window).” In Advances in Neural Information Processing Systems 2017.

19. 19 Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S., & Amodei, D. (2018). “Reward learning from human preferences and demonstrations in Atari⁠(opens in a new window).” In Advances in Neural Information Processing Systems 2018.

20. 20 Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Agarwal, S. (2020). ”Language models are few-shot learners⁠(opens in a new window).” arXiv preprint.

21. 21 Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2019). ”Exploring the limits of transfer learning with a unified text-to-text transformer⁠(opens in a new window).” arXiv preprint.

22. 22 Zhang, Y., Li, D., Wang, Y., Fang, Y., & Xiao, W. (2019). ”Exploring the limits of transfer learning with a unified text-to-text transformer⁠(opens in a new window).” In Applied Sciences.

23. 23 Christiano, P., Shlegeris, B., & Amodei, D. (2018). ”Supervising strong learners by amplifying weak experts⁠(opens in a new window).” arXiv preprint.

Nisan Stiennon, Paul Christiano, Daniel Ziegler, Ryan Lowe, Jeffrey Wu, Chelsea Voss, Long Ouyang

We’d like to thank the following people who gave feedback on various iterations of the blog post: Douwe Kiela, Zach Lipton, Alex Irpan, Jack Clark, Jacob Hilton, Raul Puri, Miles Brundage, Greg Brockman, Ilya Sutskever, Kelly Sims, Wojciech Kryscinski, and Dzimitry Bahdanau. We’d also like to thank Justin Jay Wang for driving the blog post design, Ashley Pilipiszyn for editing, Alec Radford and Dario Amodei for guidance on the project, Shan Carter for help designing the main diagram, Gretchen Krueger for co-writing the model card, Beth Barnes for help with labeler hiring and general encouragement, and many other people at OpenAI for training our large pre-trained models, supporting us through computing infrastructure improvements and maintenance, and writing fast GPU kernels. Finally, we’d like to thank all of our contractors for providing the data that was essential for training the models in this post.

Disrupting malicious uses of AI by state-affiliated threat actors Security Feb 14, 2024

Building an early warning system for LLM-aided biological threat creation Publication Jan 31, 2024

Democratic inputs to AI grant program: lessons learned and implementation plans Safety Jan 16, 2024

Our Research * Research Index * Research Overview * Research Residency * OpenAI for Science * Economic Research

Latest Advancements * GPT-5.3 Instant * GPT-5.3-Codex * GPT-5 * Codex

Safety * Safety Approach * Security & Privacy * Trust & Transparency

ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)

Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)

API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)

For Business * Business Overview * Solutions * Contact Sales

Company * About Us * Our Charter * Foundation(opens in a new window) * Careers * Brand

Support * Help Center(opens in a new window)

More * News * Stories * Livestreams * Podcast * RSS

Terms & Policies * Terms of Use * Privacy Policy * Other Policies

(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)

OpenAI © 2015–2026 Manage Cookies

English United States