Improving language understanding with unsupervised learning

We’ve obtained state-of-the-art results on a suite of diverse language tasks with a scalable, task-agnostic system, which we’re also releasing. Our approach is a combination of two existing ideas:transformers⁠(opens in a new window)andunsupervised pre-training⁠(opens in a new window). These results provide a convincing example that pairing supervised learning methods with unsupervised pre-training works very well; this is an idea that many have explored in the past, and we hope our result motivates further research into applying this idea on larger and more diverse datasets.

DatasetTaskSOTAOurs SNLI Textual entailment 89.3 89.9 MNLI matched Textual entailment 80.6 82.1 MNLI mismatched Textual entailment 80.1 81.4 SciTail Textual entailment 83.3 88.3 QNLI Textual entailment 82.3 88.1 RTE Textual entailment 61.7 56.0 STS-B Semantic similarity 81.0 82.0 QQP Semantic similarity 66.1 70.3 MRPC Semantic similarity 86.0 82.3 RACE Reading comprehension 53.3 59.0 ROCStories Commonsense reasoning 77.6 86.5 COPA Commonsense reasoning 71.2 78.6 SST-2 Sentiment analysis 93.2 91.3 CoLA Linguistic acceptability 35.0 45.4 GLUE Multi task benchmark 68.9 72.8

Our system works in two stages; first we train a transformer model on a very large amount of data in an unsupervised manner—using language modeling as a training signal—then we fine-tune this model on much smaller supervised datasets to help it solve specific tasks. We developed this approach following oursentiment neuron⁠work, in which we noted that unsupervised learning techniques can yield surprisingly discriminative features when trained on enough data. Here, we wanted to further explore this idea: can we develop one model, train it in an unsupervised way on a large amount of data, and then fine-tune the model to achieve good performance on many different tasks? Our results indicate that this approach works surprisingly well; the same core model can be fine-tuned for very different tasks with minimal adaptation.

This work builds on the approach introduced inSemi-supervised Sequence Learning⁠(opens in a new window), which showed how to improve document classification performance by using unsupervised pre-training of an LSTM followed by supervised fine-tuning. It also extendsULMFiT⁠(opens in a new window), research that shows how a single dataset-agnostic LSTM language model can be fine-tuned to get state-of-the-art performance on a variety of document classification datasets; our work shows how a Transformer-based model can be used in this approach to succeed at a broader range of tasks beyond document classification, such as commonsense reasoning, semantic similarity, and reading comprehension. It is also similar to but more task-agnostic thanELMo⁠(opens in a new window), which incorporates pre-training but uses task-customized architectures to get state-of-the-art results on a broad suite of tasks.

Very little tuning was used to achieve our results. All datasets use a single forward language model, without any ensembling, and the majority of the reported results use the exact same hyperparameter settings.

A result we are particularly excited about is the performance of our approach on three datasets—COPA⁠(opens in a new window),RACE⁠(opens in a new window), andROCStories⁠(opens in a new window)—designed to test commonsense reasoning and reading comprehension. Our model obtains new state-of-the-art results on these datasets by a wide margin. These datasets are thought to require multi-sentence reasoning and significant world knowledge to solve suggesting that our model improves these skills predominantly via unsupervised learning. This suggests there’s hope for developing complex language understanding capabilities via unsupervised techniques.

## Why unsupervised learning?

Supervised learning is at the core of most of the recent success of machine learning. However, it can require large, carefully cleaned, and expensive to create datasets to work well. Unsupervised learning is attractive because of its potential to address these drawbacks. Since unsupervised learning removes the bottleneck of explicit human labeling it also scales well with current trends ofincreasing compute⁠(opens in a new window)and availability of raw data. Unsupervised learning is a very⁠(opens in a new window)active⁠(opens in a new window)area⁠(opens in a new window)of⁠(opens in a new window)research⁠(opens in a new window)but practical uses of it are often still limited.

There’s been a recent push to try to further language capabilities by using unsupervised learning to augment systems with large amounts of unlabeled data; representations of words trained via unsupervised techniques can use large datasets consisting of terabytes of information and, when integrated with supervised learning, improve performance on a wide range of NLP tasks. Until recently, these unsupervised techniques for NLP (for example, GLoVe⁠(opens in a new window)and word2vec⁠(opens in a new window)) used simple models (word vectors) and training signals (the local co-occurence of words).Skip-Thought Vectors⁠(opens in a new window)is a notable early demonstration of the potential improvements more complex approaches can realize. But new techniques are now being used which are further boosting performance. These include the use of pre-trained sentence representation models, contextualized word vectors (notablyELMo⁠(opens in a new window) and CoVE⁠(opens in a new window)), and approaches which use customized architectures to fuse unsupervised pre-training with supervised fine-tuning, like our own.

We also noticed we can use the underlying language model to begin to perform tasks without ever training on them. For example, performance on tasks like picking the right answer to a multiple choice question steadily increases as the underlying language model improves. While the absolute performance of these methods is still often quite low compared to the supervised state-of-the-art (for question answering it still outperformed by a simple sliding-window baseline) it is encouraging that this behavior is robust across a broad set of tasks. Randomly initialized networks containing no information about the task and the world perform no-better than random using these heuristics. This provides some insight into why generative pre-training can improve performance on downstream tasks.

We can also use the existing language functionality in the model to perform sentiment analysis. For the Stanford Sentiment Treebank dataset, which consists of sentences from positive and negative movie reviews, we can use the language model to guess whether a review is positive or negative by inputting the word “very” after the sentence and seeing whether the model predicts the word “positive” or “negative” as more likely. This approach, without adapting the model at all to the task, performs on par with classic baselines~80%accuracy⁠(opens in a new window).

Our work is also a validation of the robustness and usefulness of the transformer architecture, indicating that it is sufficiently flexible to achieve state-of-the-art results on a wide range of tasks without requiring complicated task-specific customization or hyperparameter tuning.

This project has a few outstanding issues which are worth noting:

## Appendix: Dataset examples

Dataset Example Label SNLI 1. A black race car starts up in front of a crowd of people. 2. A man is driving down a lonely road.Contra. SciTail 1. Because type 1 diabetes is a relatively rare disease, you may wish to focus on prevention only if you know your child is at special risk for the disease. 2. Diabetes is unpreventable in the type one form but may be prevented by diet if it is of the second type.Neutral QNLI Context: In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. Statement: What causes precipitation to fall?Entails RTE 1. Passions surrounding Germany’s final match turned violent when a woman stabbed her partner because she didn’t want to watch the game. 2. A woman passionately wanted to watch the game.Contra. STS-B 1. They flew out of the nest in groups. 2. They flew into the nest together.Similarity 2/5 QQP 1. What are natural numbers 2. What is the least natural number Not same MRPC 1. If people took the pill daily, they would lower their risk of heart attack by 88 percent and of stroke by 80 percent, the scientists claim. 2. Taking the pill would lower the risk of heart attack by 88 percent and of stroke by 80 percent, the scientists said.Same RACE In a small village in England about 150 years ago, a mail coach was standing on the street. It didn’t come to that village often. People had to pay a lot to get a letter. The person who sent the letter didn’t have to pay the postage, while the receiver had to. “Here’s a letter for Miss Alice Brown,” said the mailman. “ I’m Alice Brown,” a girl of about 18 said in a low voice. Alice looked at the envelope for a minute, and then handed it back to the mailman. “I’m sorry I can’t take it, I don’t have enough money to pay it”, she said. A gentleman standing around were very sorry for her. Then he came up and paid the postage for her. When the gentleman gave the letter to her, she said with a smile, “ Thank you very much, This letter is from Tom. I’m going to marry him. He went to London to look for work. I’ve waited a long time for this letter, but now I don’t need it, there is nothing in it.” “Really? How do you know that?” the gentleman said in surprise. “He told me that he would put some signs on the envelope. Look, sir, this cross in the corner means that he is well and this circle means he has found work. That’s good news.” The gentleman was Sir Rowland Hill. He didn’t forgot Alice and her letter. “The postage to be paid by the receiver has to be changed,” he said to himself and had a good plan. “The postage has to be much lower, what about a penny? And the person who sends the letter pays the postage. He has to buy a stamp and put it on the envelope.” he said . The government accepted his plan. Then the first stamp was put out in 1840. It was called the “Penny Black”. It had a picture of the Queen on it. The girl handed the letter back to the mailman because: 1. she didn’t know whose letter it was 2. she had no money to pay the postage 3. she received the letter but she didn’t want to open it 4. she had already known what was written in the letter 4 ROCStories Karen was assigned a roommate her first year of college. Her roommate asked her to go to a nearby city for a concert. Karen agreed happily. The show was absolutely exhilarating. 1. Karen became good friends with her roommate. 2. Karen hated her roommate.1 COPA The man broke his toe. What was the CAUSE of this? 1. He got a hole in his sock. 2. He dropped a hammer on his foot.2 SST-2 Just the labor involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing.Positive CoLA As you eat the most, you want the least.Not acceptable

We’re increasingly interested in understanding therelationship between the compute we expend on training models and the resulting output⁠(opens in a new window). The total compute used to train this model was 0.96 petaflop days(pfs-days).

`18 P600 GPU's * 30 days * 12 TFLOPS/GPU * 0.33 utilization = 2= .96 pfs-days`

Disrupting malicious uses of AI by state-affiliated threat actors Security Feb 14, 2024

Building an early warning system for LLM-aided biological threat creation Publication Jan 31, 2024

Democratic inputs to AI grant program: lessons learned and implementation plans Safety Jan 16, 2024

Our Research * Research Index * Research Overview * Research Residency * Economic Research

Latest Advancements * GPT-5.3 Instant * GPT-5.3-Codex * GPT-5 * Codex

Safety * Safety Approach * Security & Privacy * Trust & Transparency

ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)

Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)

API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)

For Business * Business Overview * Solutions * Contact Sales

Company * About Us * Our Charter * Foundation(opens in a new window) * Careers * Brand

Support * Help Center(opens in a new window)

More * News * Stories * Academy * Livestreams * Podcast * RSS

Terms & Policies * Terms of Use * Privacy Policy * Other Policies

(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)

English United States

Improving language understanding with unsupervised learning

Google Cloud to Build $15 Billion Data Centre in Vizag with 5 GW Capacity

Elon Musk vs Sam Altman: US Court drops fraud allegations against OpenAI, but biggest AI trial of the decade is still on

APCO in talks to sell Z-Morh tunnel in J&K to Alpha Alternatives for $267 mn

Steve Ballmer blasts founder he backed who pleaded guilty to fraud: ‘I was duped and feel silly’

Latest Briefs