Introducing text and code embeddings

Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts. Our embeddings outperform top models in 3 standard benchmarks, including a 20% relative improvement in code search.

Embeddings are useful for working with natural language and code, because they can be readily consumed and compared by other machine learning models and algorithms like clustering or search.

Embeddings that are numerically similar are also semantically similar. For example, the embedding vector of “canine companions say” will be more similar to the embedding vector of “woof” than that of“meow.”

The new endpoint uses neural network models, which are descendants of GPT‑3, to map text and code to a vector representation—“embedding” them in a high-dimensional space. Each dimension captures some aspect of the input.

The new/embeddings⁠(opens in a new window)endpoint in theOpenAI API⁠(opens in a new window)provides text and code embeddings with a few lines of code:

We’re releasing three families of embedding models, each tuned to perform well on different functionalities: text similarity, text search, and code search. The models take either text or code as input and return an embedding vector.

Models Use Cases Text similarity: Captures semantic similarity between pieces of text.text-similarity-{ada, babbage, curie, davinci}-001 Clustering, regression, anomaly detection, visualization Text search: Semantic information retrieval over documents.text-search-{ada, babbage, curie, davinci}-{query, doc}-001 Search, context relevance, information retrieval Code search: Find relevant code with a query in natural language.code-search-{ada, babbage}-{code, text}-001 Code search and relevance

## Text similarity models

Text similarity models provide embeddings that capture the semantic similarity of pieces of text. These models are useful for many tasks includingclustering⁠(opens in a new window),data visualization⁠(opens in a new window), andclassification⁠(opens in a new window).

The following interactive visualization shows embeddings of text samples from the DBpedia dataset:

To compare the similarity of two pieces of text, you simply use thedot product⁠(opens in a new window)on the text embeddings. The result is a “similarity score”, sometimes called “cosine similarity⁠(opens in a new window),” between –1 and 1, where a higher number means more similarity. In most applications, the embeddings can be pre-computed, and then the dot product comparison is extremely fast to carry out.

`1import openai, numpy as np23resp = openai.Embedding.create(4 input=["feline friends say", "meow"],5 engine="text-similarity-davinci-001")67embedding_a = resp['data'][0]['embedding']8embedding_b = resp['data'][1]['embedding']910similarity_score = np.dot(embedding_a, embedding_b)`

One popular use of embeddings is to use them as features in machine learning tasks, such as classification. In machine learning literature, when using a linear classifier, this classification task is called a “linear probe.” Our text similarity models achieve new state-of-the-art results on linear probe classification inSentEval⁠(opens in a new window)(Conneau et al., 2018⁠(opens in a new window)), a commonly used benchmark for evaluating embedding quality.

## Text search models

Text search models provide embeddings that enable large-scale search tasks, like finding a relevant document among a collection of documents given a text query. Embedding for the documents and query are produced separately, and then cosine similarity is used to compare the similarity between the query and each document.

Embedding-based search can generalize better than word overlap techniques used in classical keyword search, because it captures the semantic meaning of text and is less sensitive to exact phrases or words. We evaluate the text search model’s performance on theBEIR⁠(opens in a new window)(Thakur, et al. 2021⁠(opens in a new window)) search evaluation suite and obtain better search performance than previous methods. Ourtext search guide⁠(opens in a new window)provides more details on using embeddings for search tasks.

## Code search models

Code search models provide code and text embeddings for code search tasks. Given a collection of code blocks, the task is to find the relevant code block for a natural language query. We evaluate the code search models on theCodeSearchNet⁠(opens in a new window)(Husain et al., 2019⁠(opens in a new window)) evaluation suite where our embeddings achieve significantly better results than prior methods. Check out thecode search guide⁠(opens in a new window)to use embeddings for code search.

## Examples of the embeddings API in action

### JetBrains Research

JetBrains Research’sAstroparticle Physics Lab⁠(opens in a new window)analyzes data likeThe Astronomer’s Telegram⁠(opens in a new window)and NASA’sGCN Circulars⁠(opens in a new window), which are reports that contain astronomical events that can’t be parsed by traditional algorithms.

Powered by OpenAI’s embeddings of these astronomical reports, researchers are now able to search for events like “crab pulsar bursts” across multiple databases and publications. Embeddings also achieved 99.85% accuracy on data source classification through k-means clustering.

### FineTune Learning

FineTune Learning⁠(opens in a new window)is a company building hybrid human-AI solutions for learning, likeadaptive learning loops⁠(opens in a new window)that help students reach academic standards.

OpenAI’s embeddings significantly improved the task of finding textbook content based on learning objectives. Achieving a top-5 accuracy of 89.1%, OpenAI’s text-search-curie embeddings model outperformed previous approaches like Sentence-BERT (64.5%). While human experts are still better, the FineTune team is now able to label entire textbooks in a matter of seconds, in contrast to the hours that it took the experts.

Fabius⁠(opens in a new window)helps companies turn customer conversations into structured insights that inform planning and prioritization. OpenAI’s embeddings allow companies to more easily find and tag customer call transcripts with feature requests.

For instance, customers might use words like “automated” or “easy to use” to ask for a better self-service platform. Previously, Fabius was using fuzzy keyword search to attempt to tag those transcripts with the self-service platform label. With OpenAI’s embeddings, they’re now able to find 2x more examples in general, and 6x–10x more examples for features with abstract use cases that don’t have a clear keyword customers might use.

All API customers can get started with theembeddings documentation⁠(opens in a new window)for using embeddings in their applications.

* Read documentation

Arvind Neelakantan, Lilian Weng, Boris Power, Joanne Jang

Global news partnerships: Le Monde and Prisa Media Company Mar 13, 2024

Review completed & Altman, Brockman to continue to lead OpenAI Company Mar 8, 2024

OpenAI announces new members to board of directors Company Mar 8, 2024

Our Research * Research Index * Research Overview * Research Residency * Economic Research

Latest Advancements * GPT-5.5 * GPT-5.4 * GPT-5.3 Instant * GPT-5.3-Codex

Safety * Safety Approach * Security & Privacy * Trust & Transparency

ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)

Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)

API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)

For Business * Business Overview * Solutions * Contact Sales

Company * About Us * Our Charter * Foundation(opens in a new window) * Careers * Brand

Support * Help Center(opens in a new window)

More * News * Stories * Academy * Livestreams * Podcast * RSS

Terms & Policies * Terms of Use * Privacy Policy * Other Policies

(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)

English United States

Introducing text and code embeddings

APCO in talks to sell Z-Morh tunnel in J&K to Alpha Alternatives for $267 mn

Steve Ballmer blasts founder he backed who pleaded guilty to fraud: ‘I was duped and feel silly’

Palantir is reportedly helping the IRS investigate financial crimes

Altman apologizes after OpenAI failed to alert police before Tumbler Ridge killings

Latest Briefs