Making a List and Checking It Twice

Structured Output with LLMs

Dec 30, 2024

If you’ve ever tried to integrate an LLM into your software stack, you’ve run into a challenge: how do you make sure that the LLM produces something you can parse and use with software rather than text meant for a human being?

Let’s start with an example. Say we want to produce a list for Santa of who’s been naughty or nice. Using the OpenAI API we can make a request of their 4o model:

Please output a JSON array of naughty-or-nice classifications. Each entry should have 'name', 'status', and 'reason' fields.

We get the following output:

Certainly! Here's an example of a JSON array containing classifications of "naughty" or "nice" for different names, along with reasons for their classifications:

```json
[
  {
    "name": "Alice",
    "status": "nice",
    "reason": "Helped her younger brother with homework and volunteered at the local animal shelter."
  },...
]
```

So close! We’re getting reasonable JSON, it’s just that there’s some pesky commentary from the LLM injected before the output that we really want. This make it a bit inconvenient to use with other programmatic systems – we can’t just take this output use a JSON parser on it without some finagling.

These models are trained on a large corpus of data from a variety of domains, including articles, books, source code, and conversations – but generally skews towards human-to-human communication. After the models are trained, they are further refined to output friendly human-oriented responses. When we ask them to produce structured data like JSON, we're essentially asking them to ignore portions of their training.

Let’s try the obvious: ask the LLM to only generate the JSON we want:

Please output a JSON array of naughty-or-nice classifications. Each entry should have 'name', 'status', and 'reason' fields.

Please output only the JSON, no additional text.

This helps. We get:

```json
[
  {
    "name": "Alice",
    "status": "nice",
    "reason": "Helped her friend with homework."
  },
  ...
]```

From a purity standpoint perhaps we’d like to get rid of the extraneous opening ```json and closing ```, but it’s pretty workable. We can just delete that unnecessary text.

But if we run this a bunch of times, sometimes that extraneous text may or may not be present, and other times our JSON validation code will occasionally get an error like the following:

Invalid enum value. Expected 'nice' | 'naughty', received 'Nice'

That is, sometimes the LLM decides to capitalize the enum values when we’re expecting lower-case. We could keep iterating on our prompt to be more specific about our enum values or we could write our json parsing to be case-insensitive, but if you’ve ever gone down this path you’ll know that there’s a long list of things that can go wrong, including:

Unexpected fields
Invalid JSON (e.g. missing quotes or braces)
Incorrectly nesting fields

Ideally we could just give the LLM a schema and trust that our output would adhere to that 100% of the time. And we can!

Enter Structured Output

OpenAI and Google Gemini both support explicit structured output directly in their APIs. Let’s update our example with OpenAI’s implementation and see how it works.

OpenAI supports a subset of JSON Schema to define the structure of its output. It’s a pretty reasonable subset that supports most of what you might need. It’s limited to 5 levels of nesting and doesn’t support non-required fields so you might need to tweak your schema a bit.

Using it can be a little verbose depending on your programming language, but basically you add the schema to your request. You can view the full code to do so here. Now when we run our example, we’re guaranteed that the output will match the schema that we requested.

In practice, this works really well. The schema is adhered to 100% of the time and you can stop adding ever-increasingly desperate text to your prompts like “*** VERY IMPORTANT *** Please don’t add any JSON fields that shouldn’t be present”.

How It Works

So how does OpenAI actually implement their structured output? You might expect that it’s just prompt engineering but it’s actually closer to the metal than that. To understand it fully we need to get into some details about how LLMs generate their output. They produce it one token at a time, where a token is more-or-less a word, or a JSON character like {, }, or ".

The LLM computes the probability of the next token based on its internal state. For instance if it had produced the following text so far:

The quick brown fox jumps over the lazy

It would have an extremely high probability for the token "dog" being next, but a few other words might make some sense here:

It then effectively picks a random number and uses that to select the next token.

When using structured output, the probabilities get updated dynamically based on what the schema allows. For example, if the LLM had produced the following so far:

[
    {
        "name": "Alice",
        "status": "

The probabilities for the next token might look something like this:

But the schema forces the non-schema-conforming tokens to zero, allowing only:

This is repeated for the remainder of the response, guaranteeing that the output conforms to the schema.

Drawbacks

Structured output, at least OpenAI’s current implementation, is not without limitations. It has restrictions like:

The schema can’t have more than five levels of nesting
All fields have to be required
You can’t enable additional properties

It’s not hard to want some extra levels. Requiring all fields can be a bit troublesome, especially if you have pre-existing schemas you’d like to use. The workaround of setting fields to null works but then requires some translation to interoperate with other systems.

You can also find yourself in a situation where the LLM gets itself into an infinite loop. If you prompt the LLM to output something that is impossible to generate while being compliant with the schema, you may find the LLM generating an endless stream of tokens while it tries to follow your prompt, but then gets overridden by the structured output token masking.

Lastly there can be a slight performance penalty on the first call, as the schema is compiled by OpenAI into an artifact that can be used for efficient token masking. Future calls will pull this artifact from a cache, assuming it hasn’t been evicted.

Takeaway

If you’re trying to incorporate LLMs into a service that interoperates with machines, structured output is an incredibly powerful tool. It eliminates some overly-complex prompt engineering and simplifies error handling, allowing you to focus on building your application rather than managing edge cases. Structured output is an essential tool for reliably bridging the non-deterministic nature of LLMs and the deterministic nature of traditional programmatic systems.

Learn more about LOGIC, Inc. at https://logic.inc

BITS of LOGIC, Inc.

Discussion about this post