Learning concepts with energy functions

We’ve developed anenergy-based model⁠(opens in a new window)that can quickly learn to identify and generate instances of concepts, such as near, above, between, closest, and furthest, expressed as sets of 2d points. Our model learns these concepts after only five demonstrations. We also show cross-domain transfer: we use concepts learned in a 2d particle environment to solve tasks on a 3-dimensional physics-based robot.

Many hallmarks of human intelligence, such as generalizing from limited experience, abstract reasoning and planning, analogical reasoning, creative problem solving, and capacity for language require the ability to consolidate experience into _concepts_, which act as basic building blocks of understanding and reasoning. Our technique enables agents to learn and extract concepts from tasks, then use these concepts to solve other tasks in various domains. For example, our model can use concepts learned in a two-dimensional particle environment to let it carry out the same task on a three-dimensional physics-based robotic environment—without retraining in the new environment.

This work uses energy functions to let our agents learn to _classify_ and _generate_ simple concepts, which they can use to solve tasks like navigating between two points in dissimilar environments. Examples of concepts include visual (“red” or “square”), spatial (“inside”, “on top of”), temporal (“slow”, “after”), social (“aggressive”, “helpful”) among others. These concepts, once learned, act as basic building blocks of agent’s understanding and reasoning, as shown in other research fromDeepMind⁠(opens in a new window)andVicarious⁠(opens in a new window).

Energy functions let us build systems that can generate (left) and also identify (right) basic concepts, like the notion of a square.

Energy functions work by encoding a _preference_ over states of the world, which allows an agent with different available actions (changing torque vs directly changing position) to learn a policy that works in different contexts—this roughly translates to the development of a conceptual understanding of simple things.

To create the energy function, we mathematically represent concepts asenergy models⁠(opens in a new window). The idea of energy models is rooted in physics, with the intuition that observed events and states represent low-energy configurations.

We define an energy function E(x, a, w) for each concept in terms of:

States of the world are composed of sets of entities and their properties and positions (like the dots below, which have both positional and colored properties). Attention masks, used for “identification”, represent a model’s focus on some set of entities. The energy model outputs a single positive number indicating whether the concept is satisfied (when energy is zero) or not (when energy is high). A concept is satisfied when an attention mask is focused on a set of entities that represent a concept, which requires both that the entities are in the correct positions (modification of x, or generation) and that the right entities are being focused on (modification of a, or identification).

We construct the energy function as a neural network based on therelational network architecture⁠(opens in a new window), which allows it to take an arbitrary number of entities as input. The parameters of this energy function are what is being optimized by our training procedure; other functions are derived implicitly from the energy function.

This approach lets us use energy functions to learn a single network that can performboth generation and recognition. This allows us to cross-employ concepts learned from generation to identification, and vice versa. (Note: This effect is already observed in animals viamirror neurons⁠(opens in a new window).)

## Single network training

Our training data is composed of trajectories of (attention mask, state), which we generate ahead of time for the specific concepts we’d like our model to learn. We train our model by giving it a set of demonstrations (typically 5) for a given concept set, and then give it a new environment (X0) and ask it to predict the next state (X1) and next attention mask (a). We optimize the energy function such that the next state and next attention mask found in the training data are assigned low energy values. Similar to generative models likevariational autoencoders⁠(opens in a new window), the model is incentivized to learn values that usefully compress aspects of the task. We trained our model using a variety of concepts involving, visual, spatial, proximal, and temporal relations, and quantification in a two-dimensional particle environment.

We evaluated our approach across a suite of tasks designed to see how well our single system could learn to identify and generate things united by the same concept; our system can learn to classify and generate specific sets of spatial relationships, or can navigate entities through a scene in a specific way, or can develop good judgements for concepts like quantity (one, two, three, or more than three) or proximity.

Models perform better when they can share experience between learning to generate concepts (by moving entities within the state vector x) and identify them (by changing the attention mask over a fixed state vector): when we evaluated models trained on _both_ of these operations, they performed _better_ on each single operation than models trained only on that single operation alone. We also discovered indications oftransfer learning⁠(opens in a new window)—an energy function trained only on a recognition context performs well on generation, even without being explicitly trained to do so.

In the future we’re excited to explore a wider variety of concepts learned in richer, three-dimensional environments, integrate concepts with the decision-making policies of our agents (we have so far only looked at concepts as things learned from passive experience), and explore connections between concepts and language understanding. If you are interested in this line of research, considerworking at OpenAI⁠!

* Learning Paradigms

Thanks to those who contributed to this paper and blog post:

Blog post: Prafulla Dhariwal, Alex Nichol, Alec Radford, Yura Burda, Jack Clark, Greg Brockman, Ilya Sutskever, Ashley Pilipiszyn

Learning to play Minecraft with Video PreTraining Conclusion Jun 23, 2022

CLIP: Connecting text and images Milestone Jan 5, 2021

Image GPT Publication Jun 17, 2020

Our Research * Research Index * Research Overview * Research Residency * Economic Research

Latest Advancements * GPT-5.5 * GPT-5.4 * GPT-5.3 Instant * GPT-5.3-Codex

Safety * Safety Approach * Security & Privacy * Trust & Transparency

ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)

Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)

API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)

For Business * Business Overview * Solutions * Contact Sales

Company * About Us * Our Charter * Foundation(opens in a new window) * Careers * Brand

Support * Help Center(opens in a new window)

More * News * Stories * Academy * Livestreams * Podcast * RSS

Terms & Policies * Terms of Use * Privacy Policy * Other Policies

(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)

English United States

Learning concepts with energy functions

Why Tokyo is the most important tech destination of 2026

Salesforce Plans to Hire 1,000 Graduates to Boost AI Initiatives

Apps to distract you from the endless cycle of doomscrolling

Apple under Ternus: what comes next for the tech giant’s hardware strategy

Latest Briefs