We’ve obtained state-of-the-art results on a suite of diverse language tasks with a scalable, task-agnostic system, which we’re also releasing. Our approach is a combination of two existing ideas:transformers(opens in a new window)andunsupervised pre-training(opens in a new window). These results provide a convincing example that pairing supervised learning methods with unsupervised pre-training works very well; this is an idea that many have explored in the past, and we hope our result motivates further research into applying this idea on larger and more diverse datasets.
DatasetTaskSOTAOurs SNLI Textual entailment 89.3 89.9 MNLI matched Textual entailment 80.6 82.1 MNLI mismatched Textual entailment 80.1 81.4 SciTail Textual entailment 83.3 88.3 QNLI Textual entailment 82.3 88.1 RTE Textual entailment 61.7 56.0 STS-B Semantic similarity 81.0 82.0 QQP Semantic similarity 66.1 70.3 MRPC Semantic similarity 86.0 82.3 RACE Reading comprehension 53.3 59.0 ROCStories Commonsense reasoning 77.6 86.5 COPA Commonsense reasoning 71.2 78.6 SST-2 Sentiment analysis 93.2 91.3 CoLA Linguistic acceptability 35.0 45.4 GLUE Multi task benchmark 68.9 72.8
Our system works in two stages; first we train a transformer model on a very large amount of data in an unsupervised manner—using language modeling as a training signal—then we fine-tune this model on much smaller supervised datasets to help it solve specific tasks. We developed this approach following oursentiment neuronwork, in which we noted that unsupervised learning techniques can yield surprisingly discriminative features when trained on enough data. Here, we wanted to further explore this idea: can we develop one model, train it in an unsupervised way on a large amount of data, and then fine-tune the model to achieve good performance on many different tasks? Our results indicate that this approach works surprisingly well; the same core model can be fine-tuned for very different tasks with minimal adaptation.
This work builds on the approach introduced inSemi-supervised Sequence Learning(opens in a new window), which showed how to improve document classification performance by using unsupervised pre-training of an LSTM followed by supervised fine-tuning. It also extendsULMFiT(opens in a new window), research that shows how a single dataset-agnostic LSTM language model can be fine-tuned to get state-of-the-art performance on a variety of document classification datasets; our work shows how a Transformer-based model can be used in this approach to succeed at a broader range of tasks beyond document classification, such as commonsense reasoning, semantic similarity, and reading comprehension. It is also similar to but more task-agnostic thanELMo(opens in a new window), which incorporates pre-training but uses task-customized architectures to get state-of-the-art results on a broad suite of tasks.
Very little tuning was used to achieve our results. All datasets use a single forward language model, without any ensembling, and the majority of the reported results use the exact same hyperparameter settings.
A result we are particularly excited about is the performance of our approach on three datasets—COPA(opens in a new window),RACE(opens in a new window), andROCStories(opens in a new window)—designed to test commonsense reasoning and reading comprehension. Our model obtains new state-of-the-art results on these datasets by a wide margin. These datasets are thought to require multi-sentence reasoning and significant world knowledge to solve suggesting that our model improves these skills predominantly via unsupervised learning. This suggests there’s hope for developing complex language understanding capabilities via unsupervised techniques.
## Why unsupervised learning?
Supervised learning is at the core of most of the recent success of machine learning. However, it can require large, carefully cleaned, and expensive to create datasets to work well. Unsupervised learning is attractive because of its potential to address these drawbacks. Since unsupervised learning removes the bottleneck of explicit human labeling it also scales well with current trends ofincreasing compute(opens in a new window)and availability of raw data. Unsupervised learning is a very(opens in a new window)active(opens in a new window)area(opens in a new window)of(opens in a new window)research(opens in a new window)but practical uses of it are often still limited.
There’s been a recent push to try to further language capabilities by using unsupervised learning to augment systems with large amounts of unlabeled data; representations of words trained via unsupervised techniques can use large datasets consisting of terabytes of information and, when integrated with supervised learning, improve performance on a wide range of NLP tasks. Until recently, these unsupervised techniques for NLP (for example, GLoVe(opens in a new window)and word2vec(opens in a new window)) used simple models (word vectors) and training signals (the local co-occurence of words).Skip-Thought Vectors(opens in a new window)is a notable early demonstration of the potential improvements more complex approaches can realize. But new techniques are now being used which are further boosting performance. These include the use of pre-trained sentence representation models, contextualized word vectors (notablyELMo(opens in a new window) and CoVE(opens in a new window)), and approaches which use customized architectures to fuse unsupervised pre-training with supervised fine-tuning, like our own.
We also noticed we can use the underlying language model to begin to perform tasks without ever training on them. For example, performance on tasks like picking the right answer to a multiple choice question steadily increases as the underlying language model improves. While the absolute performance of these methods is still often quite low compared to the supervised state-of-the-art (for question answering it still outperformed by a simple sliding-window baseline) it is encouraging that this behavior is robust across a broad set of tasks. Randomly initialized networks containing no information about the task and the world perform no-better than random using these heuristics. This provides some insight into why generative pre-training can improve performance on downstream tasks.
We can also use the existing language functionality in the model to perform sentiment analysis. For the Stanford Sentiment Treebank dataset, which consists of sentences from positive and negative movie reviews, we can use the language model to guess whether a review is positive or negative by inputting the word “very” after the sentence and seeing whether the model predicts the word “positive” or “negative” as more likely. This approach, without adapting the model at all to the task, performs on par with classic baselines~80%accuracy(opens in a new window).
Our work is also a validation of the robustness and usefulness of the transformer architecture, indicating that it is sufficiently flexible to achieve state-of-the-art results on a wide range of tasks without requiring complicated task-specific customization or hyperparameter tuning.
This project has a few outstanding issues which are worth noting:
## Appendix: Dataset examples
We’re increasingly interested in understanding therelationship between the compute we expend on training models and the resulting output(opens in a new window). The total compute used to train this model was 0.96 petaflop days(pfs-days).
`18 P600 GPU's * 30 days * 12 TFLOPS/GPU * 0.33 utilization = 2= .96 pfs-days`
Disrupting malicious uses of AI by state-affiliated threat actors Security Feb 14, 2024
Building an early warning system for LLM-aided biological threat creation Publication Jan 31, 2024
Democratic inputs to AI grant program: lessons learned and implementation plans Safety Jan 16, 2024
Our Research * Research Index * Research Overview * Research Residency * OpenAI for Science * Economic Research
Latest Advancements * GPT-5.3 Instant * GPT-5.3-Codex * GPT-5 * Codex
Safety * Safety Approach * Security & Privacy * Trust & Transparency
ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)
Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)
API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)
For Business * Business Overview * Solutions * Contact Sales
Company * About Us * Our Charter * Foundation * Careers * Brand
Support * Help Center(opens in a new window)
More * News * Stories * Livestreams * Podcast * RSS
Terms & Policies * Terms of Use * Privacy Policy * Other Policies
(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)
OpenAI © 2015–2026 Manage Cookies
English United States