Language models can explain neurons in language models

Language models can explain neurons in language models

We use GPT‑4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and scores for every neuron in GPT‑2.

Language models have become more capable and more broadly deployed, but our understanding of how they work internally is still very limited. For example, it might be difficult to detect from their outputs whether they use biased heuristics or engage in deception. Interpretability research aims to uncover additional information by looking inside the model.

One simple approach to interpretability research is to first understand what the individual components (neurons and attention heads) are doing. This has traditionally required humans to manually⁠(opens in a new window)inspect⁠neurons⁠(opens in a new window) to figure out what features of the data they represent. This process doesn’t scale well: it’s hard to apply it to neural networks with tens or hundreds of billions of parameters. We propose an automated process that uses GPT‑4 to produce and score natural language explanations of neuron behavior and apply it to neurons in another language model.

This work is part of the third pillar of our approach to alignment research⁠: we want to automate the alignment research work itself. A promising aspect of this approach is that it scales with the pace of AI development. As future models become increasingly intelligent and helpful as assistants, we will find better explanations.

Our methodology consists of running 3 steps on every neuron.

### Step 1: Generate explanation using GPT-4

Given a GPT-2 neuron, generate an explanation of its behavior by showing relevant text sequences and activations to GPT-4.

#### Model-generated explanation:

### Step 2: Simulate using GPT-4

Simulate what a neuron that fired for the explanation would do, again using GPT-4

Score the explanation based on how well the simulated activations match the real activations

Using our scoring methodology, we can start to measure how well our techniques work for different parts of the network and try to improve the technique for parts that are currently poorly explained. For example, our technique works poorly for larger models, possibly because later layers are harder to explain.

Although the vast majority of our explanations score poorly, we believe we can now use ML techniques to further improve our ability to produce explanations. For example, we found we were able to improve scores by:

We are open-sourcing our datasets and visualization tools for GPT‑4‑written explanations of all 307,200 neurons in GPT‑2, as well as code for explanation and scoring using publicly available models⁠(opens in a new window) on the OpenAI API. We hope the research community will develop new techniques for generating higher-scoring explanations and better tools for exploring GPT‑2 using explanations.

We found over 1,000 neurons with explanations that scored at least 0.8, meaning that according to GPT‑4 they account for most of the neuron’s top-activating behavior. Most of these well-explained neurons are not very interesting. However, we also found many interesting neurons that GPT‑4 didn't understand. We hope as explanations improve we may be able to rapidly uncover interesting qualitative understanding of model computations.

Many of our readers may be aware that Japanese consumers are quite fond of unique and creative Kit Kat products and flavors. But now, Nestle Japan has come out with what could be described as not just a new flavor but a new "species" of Kit _Kat_.

“uppercase ‘K’ followed by various combinations of letters”

“parts of words and phrases related to brand names and businesses”

“food-related terms and descriptions”

Neurons activating across layers, higher layers are more abstract.

Our method currently has many limitations⁠(opens in a new window), which we hope can be addressed in future work.

We are excited about extensions and generalizations of our approach. Ultimately, we would like to use models to form, test, and iterate on fully general hypotheses just as an interpretability researcher would.

Eventually we want to interpret our largest models as a way to detect alignment and safety problems before and after deployment. However, we still have a long way to go before these techniques can surface behaviors like dishonesty.

Jan Leike, Jeffrey Wu, Steven Bills, William Saunders, Leo Gao, Henk Tillman, Daniel Mossing

Thomas Degry, Nick Cammarata

Hannah Wong, Greg Brockman, Ilya Sutskever, Kendra Rimbach

Disrupting malicious uses of AI by state-affiliated threat actors Security Feb 14, 2024

Building an early warning system for LLM-aided biological threat creation Publication Jan 31, 2024

Democratic inputs to AI grant program: lessons learned and implementation plans Safety Jan 16, 2024

Our Research * Research Index * Research Overview * Research Residency * OpenAI for Science * Economic Research

Latest Advancements * GPT-5.3 Instant * GPT-5.3-Codex * GPT-5 * Codex

Safety * Safety Approach * Security & Privacy * Trust & Transparency

ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)

Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)

API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)

For Business * Business Overview * Solutions * Contact Sales

Company * About Us * Our Charter * Foundation(opens in a new window) * Careers * Brand

Support * Help Center(opens in a new window)

More * News * Stories * Livestreams * Podcast * RSS

Terms & Policies * Terms of Use * Privacy Policy * Other Policies

(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)

OpenAI © 2015–2026 Manage Cookies

English United States