Why language models hallucinate

At OpenAI, we’re working hard to make AI systems more useful and reliable. Even as language models become more capable, one challenge remains stubbornly hard to fully solve: hallucinations. By this we mean instances where a model confidently generates an answer that isn’t true. Our new research paper⁠(opens in a new window) argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertainty.

ChatGPT also hallucinates. GPT‑5 has significantly fewer hallucinations especially when reasoning⁠, but they still occur. Hallucinations remain a fundamental challenge for all large language models, but we are working hard to further reduce them.

## What are hallucinations?

Hallucinations are plausible but false statements generated by language models. They can show up in surprising ways, even for seemingly straightforward questions. For example, when we asked a widely used chatbot for the title of the PhD dissertation by Adam Tauman Kalai (an author of this paper), it confidently produced three different answers—none of them correct. When we asked for his birthday, it gave three different dates, likewise all wrong.

## Teaching to the test

Hallucinations persist partly because current evaluation methods set the wrong incentives. While evaluations themselves do not directly cause hallucinations, most evaluations measure model performance in a way that encourages guessing rather than honesty about uncertainty.

Think about it like a multiple-choice test. If you do not know the answer but take a wild guess, you might get lucky and be right. Leaving it blank guarantees a zero. In the same way, when models are graded only on accuracy, the percentage of questions they get exactly right, they are encouraged to guess rather than say “I don’t know.”

As another example, suppose a language model is asked for someone’s birthday but doesn’t know. If it guesses “September 10,” it has a 1-in-365 chance of being right. Saying “I don’t know” guarantees zero points. Over thousands of test questions, the guessing model ends up looking better on scoreboards than a careful model that admits uncertainty.

For questions where there is a single “right answer,” one can consider three categories of responses: accurate responses, errors, and abstentions where the model does not hazard a guess. Abstaining is part of humility, one of OpenAI’s core values⁠. Most scoreboards prioritize and rank models based on accuracy, but errors are worse than abstentions. Our Model Spec⁠(opens in a new window) states that it is better to indicate uncertainty or ask for clarification than provide confident information that may be incorrect.

For a concrete example, consider the SimpleQA eval as an example from the GPT5 System Card⁠(opens in a new window).

Metricgpt-5-thinking-miniOpenAI o4-mini Abstention rate

(no specific answer is given)52%1% Accuracy rate

(right answer, higher is better)22%24% Error rate

(wrong answer, lower is better)26%75% Total 100%100%

In terms of accuracy, the older OpenAI o4-mini model performs slightly better. However, its error rate (i.e., rate of hallucination) is significantly higher. Strategically guessing when uncertain improves accuracy but increases errors and hallucinations.

When averaging results across dozens of evaluations, most benchmarks pluck out the accuracy metric, but this entails a false dichotomy between right and wrong. On simplistic evals like SimpleQA, some models achieve near 100% accuracy and thereby eliminate hallucinations. However, on more challenging evaluations and in real use, accuracy is capped below 100% because there are some questions whose answer cannot be determined for a variety of reasons such as unavailable information, limited thinking abilities of small models, or ambiguities that need to be clarified.

Nonetheless, accuracy-only scoreboards dominate leaderboards and model cards, motivating developers to build models that guess rather than hold back. That is one reason why, even as models get more advanced, they can still hallucinate, confidently giving wrong answers instead of acknowledging uncertainty.

#### A better way to grade evaluations

There is a straightforward fix. Penalize confident errors more than you penalize uncertainty, and give partial credit for appropriate expressions of uncertainty. This idea is not new. Some standardized tests have long used versions of negative marking for wrong answers or partial credit for leaving questions blank to discourage blind guessing. Several research groups have also explored evaluations that account for uncertainty and calibration.

Our point is different. It is not enough to add a few new uncertainty-aware tests on the side. The widely used, accuracy-based evals need to be updated so that their scoring discourages guessing. If the main scoreboards keep rewarding lucky guesses, models will keep learning to guess. Fixing scoreboards can broaden adoption of hallucination-reduction techniques, both newly developed and those from prior research.

## How hallucinations originate from next-word prediction

We’ve talked about why hallucinations are so hard to get rid of, but where do these highly-specific factual inaccuracies come from in the first place? After all, large pretrained models rarely exhibit other kinds of errors such as spelling mistakes and mismatched parentheses. The difference has to do with what kinds of patterns there are in the data.

Language models first learn through _pretraining_, a process of predicting the next word in huge amounts of text. Unlike traditional machine learning problems, there are no “true/false” labels attached to each statement. The model sees only positive examples of fluent language and must approximate the overall distribution.

It’s doubly hard to distinguish valid statements from invalid ones when you don’t have any examples labeled as invalid. But even with labels, some errors are inevitable. To see why, consider a simpler analogy. In image recognition, if millions of cat and dog photos are labeled as “cat” or “dog,” algorithms can learn to classify them reliably. But imagine instead labeling each pet photo by the pet’s birthday. Since birthdays are essentially random, this task would always produce errors, no matter how advanced the algorithm.

The same principle applies in pretraining. Spelling and parentheses follow consistent patterns, so errors there disappear with scale. But arbitrary low-frequency facts, like a pet’s birthday, cannot be predicted from patterns alone and hence lead to hallucinations. Our analysis explains which kinds of hallucinations should arise from next-word prediction. Ideally, further stages after pretraining should remove them, but this is not fully successful for reasons described in the previous section.

We hope that the statistical lens in our paper clarifies the nature of hallucinations and pushes back on common misconceptions:

Our latest models have lower hallucination rates, and we continue to work hard to further decrease the rates of confident errors output by our language models.

## Announcement contributors

Adam Kalai, Santosh Vempala (Georgia Tech), Ofir Nachum, Eddie Zhang, David Robinson, Saachi Jain, Eric Mitchell, Alex Beutel, Johannes Heidecke

GPT-5.5 System Card Safety Apr 23, 2026

Introducing OpenAI Privacy Filter Research Apr 22, 2026

Introducing GPT-Rosalind for life sciences research Research Apr 16, 2026

Our Research * Research Index * Research Overview * Research Residency * Economic Research

Latest Advancements * GPT-5.5 * GPT-5.4 * GPT-5.3 Instant * GPT-5.3-Codex

Safety * Safety Approach * Security & Privacy * Trust & Transparency

ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)

Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)

API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)

For Business * Business Overview * Solutions * Contact Sales

Company * About Us * Our Charter * Foundation(opens in a new window) * Careers * Brand

Support * Help Center(opens in a new window)

More * News * Stories * Academy * Livestreams * Podcast * RSS

Terms & Policies * Terms of Use * Privacy Policy * Other Policies

(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)

English United States

Why language models hallucinate

Ficci flags audit rule change proposal in Companies Act, warns of higher costs, reduced expertise

US State Department Highlights Allegations of AI Theft by Chinese Firms

$40 Billion from Google. $25 Billion from Amazon. Anthropic is now the most fought-over AI company on Earth

Indian IT Seeks Early Access to Anthropic's Claude Mythos AI Model

Latest Briefs