news

Structured Outputs and How to Use Them

Vaseline August 9, 2024

Building Robustness and Determinism in LLM Applications

Structured Outputs and How to Use Them Image by the author

OpenAI recently announced support for Structured outputs in his last gpt-4o-2024–08–06 models. Structured outputs in relation to large language models (LLMs) are nothing new — developers have either used various prompt engineering techniques or third-party tools.

In this article, we’ll explain what structured outputs are, how they work, and how you can apply them in your own LLM-based applications. While OpenAI’s announcement makes it quite easy to implement using their APIs (as we’ll demonstrate here), you may want to opt for the open source Outlines package (maintained by the lovely folks at dottxt) instead, as it can be applied to both the self-hosted open-weight models (e.g. Mistral and LLaMA), as well as the proprietary APIs (Disclaimer: Due to this issue, Outlines does not yet support structured JSON generation via OpenAI APIs at the time of writing; but that will change soon!).

What are structured outputs?

If the RedPajama dataset is any indication, the vast majority of pre-training data is human text. Therefore, “natural language” is the native domain of LLMs — both in input and output. However, when we build applications, we want to use machine-readable formal structures or schemas to encapsulate our data input/output. In this way, we build robustness and determinism into our applications.

Structured outputs is a mechanism by which we enforce a predefined schema on the LLM output. Typically this means enforcing a JSON schema, but it is not limited to just JSON — we can essentially enforce XML, Markdown, or a completely custom schema. The benefits of structured output are twofold:

Simpler prompt design — we need not be too verbose in specifying what the output should look like
Deterministic names and types— we can warrantyfor example to get an age attribute with a JSON type Number in the LLM response

Implementing a JSON schema

For this example, we’ll use the first sentence from Sam Altman’s Wikipedia article…

Samuel Harris Altman (born April 22, 1985) is an American entrepreneur and investor best known as the CEO of OpenAI since 2019 (he was briefly fired and reinstated in November 2023).

…and we’re going to use the latest GPT-4o checkpoint as a named-entity recognition (NER) system. We’ll enforce the following JSON schema:

json_schema = {
    "name": "NamedEntities",
    "schema": {
        "type": "object",
        "properties": {
            "entities": {
                "type": "array",
                "description": "List of entity names and their corresponding types",
                "items": {
                    "type": "object",
                    "properties": {
                        "name": {
                            "type": "string",
                            "description": "The actual name as specified in the text, e.g. a person's name, or the name of the country"
                        },
                        "type": {
                            "type": "string",
                            "description": "The entity type, such as 'Person' or 'Organization'",
                            "enum": ("Person", "Organization", "Location", "DateTime")
                        }
                    },
                    "required": ("name", "type"),
                    "additionalProperties": False
                }
            }
        },
        "required": ("entities"),
        "additionalProperties": False
    },
    "strict": True
}

Essentially, our LLM response should contain a NamedEntities object, which is an array of entities, each with a name and type. There are a few things to note here. For example, we can enforce Enum type, which is very useful in NER because we can restrict the output to a fixed set of entity types. We need to specify all fields in the required array: however, we can also emulate “optional” fields by setting the type to eg (“string”, null) .

We can now pass our schema, along with the data and instructions, to the API. We need to fill the response_format argument with a dictate where we set the type to “json_schema” and then specify the corresponding schema.

completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=(
        {
            "role": "system",
            "content": """You are a Named Entity Recognition (NER) assistant.
                Your job is to identify and return all entity names and their 
                types for a given piece of text. You are to strictly conform
                only to the following entity types: Person, Location, Organization
                and DateTime. If uncertain about entity type, please ignore it.
                Be careful of certain acronyms, such as role titles "CEO", "CTO",
                "VP", etc - these are to be ignore.""",
        },
        {
            "role": "user",
            "content": s
        }
    ),
    response_format={
        "type": "json_schema",
        "json_schema": json_schema,
    }
)

The output should look something like this:

{   'entities': (   {'name': 'Samuel Harris Altman', 'type': 'Person'},
                    {'name': 'April 22, 1985', 'type': 'DateTime'},
                    {'name': 'American', 'type': 'Location'},
                    {'name': 'OpenAI', 'type': 'Organization'},
                    {'name': '2019', 'type': 'DateTime'},
                    {'name': 'November 2023', 'type': 'DateTime'})}

The full source code used in this article is available here.

How it works

The magic is in the combination of limited sampling And context-free grammar (CFG). We mentioned earlier that the overwhelming majority of pre-training data is “natural language”. Statistically, this means that for each decoding/sampling step, there is a non-negligible probability of sampling a random token from the learned vocabulary (and in modern LLMs, vocabularies typically span 40,000+ tokens). However, when working with formal schemes, we really want to eliminate all unlikely tokens quickly.

In the previous example, if we have already generated…

{   'entities': (   {'name': 'Samuel Harris Altman',

…then ideally we would want to place a very high logit bias on the ‘typ’ token in the next decoding step, and a very low probability on all other tokens in the vocabulary.

This is essentially what happens. When we provide the schema, it is transformed into a formal grammar, or CFG, which serves to guide the logit bias values during the decoding step. CFG is one of those old-school computer science and natural language processing (NLP) mechanisms that is making a comeback. A really nice introduction to CFG was actually presented in this StackOverflow answer , but essentially it is a way to describe transformation rules for a collection of symbols.

Conclusion

Structured outputs are nothing new, but they are definitely becoming top-of-mind with proprietary APIs and LLM services. They are a bridge between the erratic and unpredictable ‘natural language’ domain of LLMs and the deterministic and structured domain of software engineering. Structured outputs are essentially a mustfor anyone designing complex LLM applications where LLM output needs to be shared or “presented” across components. While API-native support has finally arrived, builders should also consider using libraries like Outlines, as they provide an LLM/API-agnostic way to handle structured output.

Structured Outputs and How to Use Them was originally published in Towards Data Science on Medium. People continued the discussion by bookmarking and commenting on this story.