a machine learning blog

Building a Semantic Search Engine With OpenAI and Pinecone

March 23, 2023

In this post, we will walk through how to build a simple semantic search engine using an OpenAI embedding model and a Pinecone vector database. More specifically, we will see how to build searchthearxiv.com, a semantic search engine enabling students and researchers to search across more than 250,000 ML papers on arXiv using natural language. The principles covered will be general enough for you to apply the same techniques to your own dataset, so you can supercharge search across your own set of documents.

Step 0: Overview

Before we get our hands dirty, let us first break down the problem we are trying to solve into discrete steps that we can attack one by one:

  1. We need data. This could be in the form of raw PDF documents, an SQL database, or even a raw JSON file. In our case, we will be using the arXiv metadataset on Kaggle, which contains metadata (such as title and abstract) for every paper on arXiv in JSON format.
  2. We need embeddings. Once we have our data, we need to embed every entry (in our case, every paper), such that each one is represented by a high-dimensional vector.
  3. We need a vector index. Once we have our embeddings, we need to be able to efficiently search across them. Services like Pinecone can store millions of embeddings and allow you to do lightning-fast cosine similarity search given a query embedding (more on this later).
  4. We need an interface. The final step is to build an interface in which users can enter queries and search our database using natural language. Typically, this would be some sort of web frontend; here we will be using a simple command-line interface instead.

Step 1: Data

The arXiv dataset on Kaggle is maintained by Cornell University and is updated on a weekly basis. It contains every paper posted on arXiv across all STEM fields. We will be using a subset of this dataset, keeping only the ML papers. The latest version can be downloaded manually on the dataset page or, even better, using the Kaggle CLI:

kaggle datasets download -d Cornell-University/arxiv && unzip arxiv.zip

This will create a JSON file titled arxiv-metadata-oai-snapshot.json. The details of how we load and preprocess the data are quite boring and will hardly be applicable to your own dataset. (If you’re still curious, head over to the GitHub repo and inspect the code.) All you need to know is that we convert each JSON object to a custom Paper representation and filter out the ones published before 2012 and/or not belonging to any of the ML categories.

JSON_FILE_PATH = "arxiv-metadata-oai-snapshot.json"
CATEGORIES = ["cs.cv", "cs.lg", "cs.cl", "cs.ai", "cs.ne", "cs.ro"]
START_YEAR = 2012

print("Loading data...")
papers = list(load_data(JSON_FILE_PATH, CATEGORIES, START_YEAR))

The categories that we specify correspond roughly to all ML papers in the dataset and are the same used by Andrej Karpathy’s arxiv-sanity-lite.

Step 2: Embeddings

The magic of modern semantic search rests on the notion of embeddings, high-dimensional vectors that encode the semantics of their underlying text.

In order to create embeddings of the papers in our dataset, we will be using an OpenAI embedding model named text-embedding-ada-002. At USD 0.0004 per 1K tokens, this is OpenAI’s cheapest embedding model. For reference, the total number of tokens in our dataset is approximately 70 million, meaning that embedding the entire thing will cost around USD 30.

Peeking into the Paper class, we find the following helper method:

@property
def embedding_text(self):
    text = ["Title: " + self.title,
            "By: " + self.authors_string,
            "From: " + str(self.year),
            "Abstract: " + self.abstract]
    return ". ".join(text)

When embedding a paper, we will thus be using an “augmented abstract” consisting of the paper’s title, the list of authors, and the year of publication. The reason we don’t embed the raw abstract is to have a chance of returning useful results for queries such as paper by yoshua bengio (although this type of query won’t give the most useful results in practice).

⚠️ Danger zone: Embedding your data is not free, so be careful running the OpenAI embedding code haphazardly. It’s easy to call the API and retrieve the embeddings, only to lose them immediately once the program terminates.

The helper function below takes a list of strings (in this case a list of “augmented abstracts”) and embeds them using the specified OpenAI model:

import openai

openai.api_key = os.environ["OPENAI_API_KEY"]

def get_embeddings(texts, model="text-embedding-ada-002"):
    embed_data = openai.Embedding.create(input=texts, model=model)
    return embed_data["data"]

If you inspect the result of calling this function with $n$ augmented abstracts, the return value will be a list of $n$ 1536-dimensional vectors, one for each abstract. The next step is to store the embeddings in a safe place that allows for efficient search.

Step 3: Vector index

To store the embeddings, we will be using Pinecone. The purpose of Pinecone is to persistently store your embeddings, while enabling you to efficiently search across them using a simple API. When you’re signed up and have created an index, you connect to it like this:

import pinecone

pinecone.init(api_key=os.environ["PINECONE_API_KEY"])
index_name = os.environ["PINECONE_INDEX_NAME"]
index = pinecone.Index(index_name)

We now create a new function that takes our list of Paper objects, a Pinecone index name, and the name of an OpenAI embedding model, embeds the papers and uploads them to Pinecone:

def embed_and_upsert(papers, index_name, model, batch_size=100):
    with pinecone.Index(index_name, pool_threads=5) as index:
        for i in tqdm(range(0, len(papers), batch_size)):
            batch = papers[i:i+batch_size]
            texts = [paper.embedding_text for paper in batch]
            embed_data = get_embeddings(texts, model)
        
            pc_data = [(p.id, e["embedding"], p.metadata)
                       for p, e in zip(batch, embed_data)]
            index.upsert(pc_data)

Here, p.metadata is a dictionary {"title": paper.title, "authors": paper.authors, "abstract": paper.abstract, "year": paper.year, "month": paper.month}. When fetching search results from Pinecone, this will allow us to display the paper to the user. As you will notice, each paper also has a unique id, in this case the arXiv id associated with each paper in the original dataset.

If you call this function and keep an eye on your index in the Pinecone console, you should see the number of vectors tick up as the embeddings are uploaded.

Step 4: Interface

The fourth and final step is to enable search across your recently created vector index. Fortunately, Pinecone makes this incredibly easy. When receiving a query, we simply embed it using the same embedding model that we used for the dataset. We then send the query embedding to Pinecone, which identifies the $k$ entries in the index with the highest cosine similarity to the query embedding (i.e. the ones that are most semantically similar).

To see this in action, run the following script:

import os
import openai
import pinecone

openai.api_key = os.environ["OPENAI_API_KEY"]

def get_embeddings(texts, model="text-embedding-ada-002"):
    embed_data = openai.Embedding.create(input=texts, model=model)
    return embed_data["data"]

pinecone.init(api_key=os.environ["PINECONE_API_KEY"])
index_name = os.environ["PINECONE_INDEX_NAME"]
index = pinecone.Index(index_name)

query = input("Enter your query: ")
embed = get_embeddings(query)[0]["embedding"]
response = index.query(vector=embed, top_k=5, include_metadata=True)
matches = response["matches"]

for i, match in enumerate(matches):
    metadata = match["metadata"]
    print(f"{i+1}: {metadata['title']}")

This will produce an output like the following:

Enter your query: model using only attention mechanism
1: Attention Is All You Need
2: On the Dynamics of Training Attention Models
3: Focus On What's Important: Self-Attention Model for Human Pose Estimation
4: Attention-Based Models for Speech Recognition
5: Self-Attentional Models Application in Task-Oriented Dialogue Generation Systems

Conclusion

Thanks to the increased availability of language models, building custom semantic search engines—once a near-impossible task for a single person—is now extremely simple to implement. While launching a site like searchthearxiv.com involves additional challenges such as hosting, running a web server, keeping the database up-to-date, and so on, you should now be equipped to create your own semantic search engine using your own data.

If you would like to dive further into the details of how searchthearxiv.com works, check out the associated GitHub repo and give it a star if you find it useful ⭐️