The Perfect Holiday Cookie

The role of embeddings and rerankers in semantic retrieval.

Dec 29, 2024

It’s the holiday season, and you’re on the hunt for the perfect peppermint-infused sugar cookie recipe to wow your guests. But sifting through thousands of online recipe sites can be tedious. What if you could simply type your wish into a search bar and instantly see the best holiday cookie recipes? Even better – ranked and filtered precisely by your preferences and the ingredients you already have.

We’re going to walk through how to do this using modern semantic retrieval techniques. If we were using traditional information retrieval, we would create simple inverted indexes that map words to recipes and then find recipes based on keyword matching – but we live in the Renaissance Age of natural language processing and no longer need to rely exclusively on discrete string matching.

We can now retrieve results based on their abstract meaning and semantics, and even filter based on fuzzy criteria such as “I don’t want it to take too long to bake”. This kind of semantic retrieval is used in many LLM systems these days, but embeddings are still a lesser known concept to many LLM practitioners, and rerankers even less so.

What Are Embeddings?

An embedding is a mechanism for squeezing the key details out of an arbitrary piece of data and representing those details in a small fixed-size set of numbers (a.k.a. vector).

The special thing about this list of numbers is that the more similar two pieces of data are, the more similar their list of numbers will be.

They are also cheap to compute, so we can compute an embedding every time we see a new document and store that association in a database. Later on when want to find documents that are similar, we can search through the embeddings we’ve already computed and return the closest ones.

Importantly, using vector databases, we can search through large quantities of embeddings very efficiently. We can have billions of embeddings without needing to check billions of entries every time.

What Are Rerankers?

Rerankers tell you how well a piece of data relates to a given query. You give a reranker two values:

Query: The thing that we’re searching for (e.g. “Recipes with peppermint”)
Document: Any document (a.k.a. data) from our corpus of documents.

The reranker will give us a score that tells us how useful this document is for this query. Rerankers tend to be slower, more costly, and can’t be precomputed (since we don’t know the query beforehand) so we need to use these more sparingly than embeddings.

Why Do We Need Both?

You might be wondering, “If we can use embeddings to figure out how closely related two things are, why do we need rerankers at all?”

The short answer is that embeddings are like a lossy compression algorithm. They’ll capture general characteristics about a given input, but not specific details. So if you compute the embedding for Food.com’s Bestest Butter Cookie, the embedding will capture high-level details like “cookies”, “butter”, and “eggs”, but it won’t capture small details like how much of each ingredient to use, or how long to bake.

Rerankers, on the other hand, can look at each recipe in full detail, along with a query like “I want to bake vegan cookies that have peppermint and take no longer than 20 minutes total prep + bake time.” and then tell you how well this recipe applies to that query.

Two-Stage Retrieval

Since rerankers can be far more precise, in an ideal world we could just run the reranker against every recipe we know about every single time a query comes in. We don’t do this in practice though because it is too slow and too expensive.

So instead retrieval is typically done in two stages:

1. Candidate Selection

We use embeddings to whittle down our millions of known recipes down to dozens that seem roughly related to the query at hand.

For example, we’d compute the embedding for the user’s query “I want to bake vegan cookies that have peppermint and take no longer than 20 minutes total prep + bake time.” and then grab the top 100 recipes with the closest embeddings to that query’s embedding.

These will be our candidates that we send to the reranker.

2. Candidate Ranking

Now that we’ve reduced our recipes for consideration from millions to a hundred, we can send those to the reranker and do a higher-fidelity ranking of them. We can than choose the top 10 to show to the user (or give them to the LLM, if we’re using this for RAG).

Semantic Recipe Retrieval

Let’s run through a fully functioning end-to-end example to show these in action working together. This won’t be a production-worthy implementation, but will run entirely locally on your CPU (no GPU needed!).

If you want to jump ahead and just see the code, here’s a complete working example (and in a slightly more usable format than the inline code below).

1. Gather Recipes

For this example we’re going to use Food.com’s Recipe data from Kaggle. It’s a large data set, so we’re going to further filter it to things that just mention “cookie”, “holiday”, and “Christmas”.

import pandas as pd

# Example using a CSV from Kaggle
df = pd.read_csv("recipes.csv")

# Filter for rows containing 'cookie' in the title
cookie_df = df[df['name'].str.lower().str.contains('cookie')]

# Filter cookies for rows containing 'holiday' or 'christmas' in the title
holiday_cookie_df = cookie_df[cookie_df['name'].str.lower().str.contains('holiday|christmas')]

2. Index Recipes

Now we’re going to compute embeddings for each of the remaining recipes. There are many hosted embedding providers we could use, like OpenAI’s, but we’re going to use an open source model that runs locally in this example.

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load our embedding model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Embed all holiday cookie recipes
recipe_embeddings = model.encode(holiday_cookie_df['RecipeInstructions'].tolist(), convert_to_numpy=True)

# Build a Faiss index (this is a vector database)
d = recipe_embeddings.shape[1] # Dimensionality of embeddings
index = faiss.IndexFlatIP(d) # Using cosine similarity
index.add(recipe_embeddings)

3. Search Recipes

Now we’ve got an indexed set of embeddings that we can quickly search across:

query = "Festive sugar cookies with a peppermint twist"
query_emb = model.encode([query], convert_to_numpy=True)
top_k = 50
distances, indices = index.search(query_emb, top_k)

# Print the top matching recipes
print(f"\nTop {top_k} matches for query: '{query}'\n")
for i, (idx, score) in enumerate(zip(indices[0], distances[0])):
    print(f"#{i+1} (Score: {score:.3f})")
    print(f"Recipe: {holiday_cookie_df.iloc[idx]['Name']}")
    print(f"Description: {holiday_cookie_df.iloc[idx]['Description']}")
    print(f"Instructions: {holiday_cookie_df.iloc[idx]['RecipeInstructions'][:200]}...")  # First 200 chars of instructions
    print("-" * 80 + "\n")

4. Rerank Recipes

Now we’ve got a bunch of candidate recipes, and we can further rerank them to have recipes that match our specific criteria:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "cross-encoder/ms-marco-MiniLM-L-12-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
reranker = AutoModelForSequenceClassification.from_pretrained(model_name)

def get_relevance_score(query, recipe_text):
    # Concatenate query and recipe into a single string
    combined_input = f"Query: {query}\nRecipe: {recipe_text}"
    inputs = tokenizer(combined_input, return_tensors='pt', truncation=True, max_length=512)
    with torch.no_grad():
        outputs = reranker(**inputs) # Example: output logits for 2 classes (relevant/irrelevant) or a regression head
        logits = outputs.logits[0].tolist() # Simplified approach: take the "relevant" logit as the score
        return logits[1]

# Rerank top-k candidates
recipe_scores = []
for i in indices[0]:
    recipe = holiday_cookie_df.iloc[i]
    recipe_text = recipe['Name'] + " " + recipe['Description'] + " " + recipe['RecipeInstructions']
    score = get_relevance_score(query, recipe_text)
    recipe_scores.append((i, score))

# Sort by descending score
ranked_recipes = sorted(recipe_scores, key=lambda x: x[1], reverse=True)

5. Bake Some Cookies

Now when we search for “Simple Christmas cookie shaped like a wreath” we’ll get:

Before Reranking:
1. “Stained-Glass Christmas Tree Cookies”
2. “Healthy Christmas Cookies, Vegan and Gluten Free”
3. “Christmas Tree Cookies”
4. “Shelb's Christmas Gingerbread Cookies”
5. “Holiday Swirl Cookies”
After Reranking:
1. “Christmas Wreath Cookies”
2. “Candy Cane Filled Christmas Tree Cookies”
3. “Christmas Tree Cookies”
4. “Magical Sparkling Snowflakes: Christmas Butter Biscuits-Cookies”
5. “Holiday Spritz Cookies”

Food.com’s dataset even has a cookie recipe with “wreath” in the name, but that detail wasn’t enough for it to rank highly in the embedding. Sure enough, the reranker caught and fixed it.

Interestingly, it seems “Christmas tree” is treated semantically very similarly to “Christmas wreath”, which is consistent with our intuition.

You Don’t Always Need a Reranker

There are many scenarios where relying on embedding similarity alone will be more than sufficient for returning relevant results, so it’s usually worth trying that first. However, if you need that extra little oomph because your domain or data just isn’t quite fully captured in the embedding, a reranker is a great step to try and get better results.

Try It Yourself

If you’d like to try this yourself, here’s a fully working code example. Just be sure to download the recipe data and have the `recipes.csv` file in the same directory as your Python code.

If you’re considering applying these techniques in production environments, I’d encourage you to look at hosted embeddings and rerankers like OpenAI’s and Cohere’s.

Learn more about LOGIC, Inc at https://logic.inc

BITS of LOGIC, Inc.

Discussion about this post