Structured Outputs and OSS

Building tooling for Reliable LLM Applications

Ivan Leo

What we’ll Cover

Why you should care about structured outputs
Lessons learnt from working in Open Source for a year
How to build more reliable systems incrementally

About Me

Research Engineer at 567 Labs with an interest in Synthetic Data and LLM evaluations
Previous experience was in full stack software engineering
Been doing open source for a year and it’s really changed my life

The Evolution of LLM Tooling

For a lot of language models nowadays, reliability and consistency are paramount.

As we shift from toy demos to enterprise level applications, building applications that are accurate, consistent and able to handle larger workloads is a new challenge.

The Power of Consistent Interfaces

Instructor is like a good Japanese knife. It does one thing really well and that’s structured outputs.

We don’t hide much from the user or do anything fancy. We just give you

A consistent interface across different providers
Good type checking and validation
Automatic retries that you can configure with tenacity.

Let’s see this in action

OpenAI Implementation

import instructor
from openai import OpenAI
from pydantic import BaseModel

class ExtractUser(BaseModel):
    name: str
    age: int

client = instructor.from_openai(OpenAI())
resp = client.chat.completions.create(
    model="gpt-4",
    response_model=ExtractUser,
    messages=[{"role": "user", "content": "John Doe is 30 years old."}],
)

assert isinstance(resp, ExtractUser)
assert resp.name == "Jason"
assert resp.age == 25

Anthropic Implementation

import instructor
from anthropic import Anthropic
from pydantic import BaseModel

class ExtractUser(BaseModel):
    name: str
    age: int

client = instructor.from_anthropic(Anthropic())
resp = client.chat.completions.create(
    model="claude-3-5-sonnet-20240620",
    response_model=ExtractUser,
    messages=[{"role": "user", "content": "Extract Jason is 25 years old."}],
)

assert isinstance(resp, ExtractUser)
assert resp.name == "Jason"
assert resp.age == 25

Gemini Implementation

import instructor
import google.generativeai as genai
from pydantic import BaseModel

class ExtractUser(BaseModel):
    name: str
    age: int

client = instructor.from_gemini(client=genai.GenerativeModel(model_name="models/gemini-1.5-flash-latest"))
resp = client.chat.completions.create(
    response_model=ExtractUser,
    messages=[{"role": "user", "content": "Extract Jason is 25 years old."}],
)

assert isinstance(resp, ExtractUser)
assert resp.name == "Jason"
assert resp.age == 25

Consistency Across Providers

Notice how easy it is to switch providers, as the pattern remains unchanged.

Sure there are some small changes that are provider specific (Eg. Gemini ) but otherwise the core logic remains the same.

You get a validated Pydantic model at all times. But what happens when our needs get more advanced?

Streaming Support

import instructor
from pydantic import BaseModel

class User(BaseModel):
    name: str
    age: int

client = instructor.from_openai(openai.OpenAI())
user_stream = client.chat.completions.create_partial(
    model="gpt-4-turbo-preview",
    messages=[{"role": "user", "content": "Create a user"}],
    response_model=User,
)

for user in user_stream:
    print(user)  # Watch the response build incrementally

Streaming in Action

Watch as our response builds incrementally in real-time:

Jinja Templating

We also support jinja templating. You can use the same values for prompt formatting as your validation logic.

import openai
import instructor
from pydantic import BaseModel, field_validator, ValidationInfo

class QueryFilters(BaseModel):
    materials: list[str]

    @field_validator("materials")
    def check_materials(cls, value: list[str], info: ValidationInfo) -> list[str]:
        # Ensure every material is provided as a non-empty string.
        for material in value:
            if not (isinstance(material, str) and material.strip()):
                raise ValueError("Every material in the list must be provided as a non-empty string.")
        return value


async def extract_query_filters(
    query: str, client: openai.AsyncOpenAI, materials: list[str]
) -> QueryFilters:
    return await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """
                    You are a helpful assistant that classifies clothing.

                    References:
                    - Materials: {{ materials }}

                    Output the clothing material that y've identified
                """
            },
            {"role": "user", "content": query},
        ],
        context={
            "materials": materials,
        },
        response_model=QueryFilters,
    )

Takeaways

By using structured outputs and instructor, we can abstract away a whole class of errors

We get validated pydantic models that we can define custom business logic on
We get type safety across our application and can ship with confidence
We can separate the validation logic from the generation logic itself. These can be tested separately and by using a Pydantic model, we can iterate quickly on the response model

But it’s more than just JSON -> Python class

Structured Outputs

While most people think of structured outputs as a way to parse JSON function calls, I think using Pydantic has a few key benefits that people don’t realise

It makes iteration easy
We can define class methods that interface with our systems
Custom validation becomes easy

Let’s see an example

Easy Iteration

From our own experiments, changing field names can impact accuracy by up to 60%.

class AnswerWithNecessaryCalculationAndFinalChoice(BaseModel):
    chain_of_thought: str
    necessary_calculations: list[str]
    potential_final_choices: list[str]
    final_choice: int

Easy Iteration

From our own experiments, changing field names can impact accuracy by up to 60%.

class AnswerWithNecessaryCalculationAndFinalChoice(BaseModel):
    chain_of_thought: str
    necessary_calculations: list[str]
    potential_final_answers: list[str]
    answer: int # This caused a increase in accuracy by 90%

Class Methods

We can also define class methods that can interface directly with our systems.

class SearchIssues(BaseModel):
    """
    Use this when the user wants to get original issue information from the database
    """

    query: Optional[str]
    repo: str = Field(
        description="the repo to search for issues in, should be in the format of 'owner/repo'"
    )

    async def execute(self, conn: Connection, limit: int):
        if self.query:
            embedding = (
                OpenAI()
                .embeddings.create(input=self.query, model="text-embedding-3-small")
                .data[0]
                .embedding
            )
            args = [self.repo, limit, embedding]
        else:
            args = [self.repo, limit]
            embedding = None

        sql_query = Template(
            """
            SELECT *
            FROM {{ table_name }}
            WHERE repo_name = $1
            {%- if embedding is not none %}
            ORDER BY embedding <=> $3
            {%- endif %}
            LIMIT $2
            """
        ).render(table_name="github_issues", embedding=embedding)

        return await conn.fetch(sql_query, *args)

Custom Validation

We can also define logic to repair and fix faulty values

from pydantic import BaseModel, field_validator
from fuzzywuzzy import fuzz

class Product(BaseModel):
    category: str

    @field_validator("category")
    def validate_category(cls, v: str) -> str:
        known_categories = ["t-shirts", "shirts", "pants"]
        matches = [(fuzz.ratio(v.lower(), cat), cat) for cat in known_categories]
        best_match = max(matches, key=lambda x: x[0])
        if best_match[0] > 80:  # Threshold for fuzzy matching
            return best_match[1]
        raise ValueError(f"No matching category found for {v}")

This is useful when we have say a list of categories that might have mispellings or are slightly different.

Learning Through Open Source

Open source has been an incredible learning journey. Here’s what I’ve discovered:

It’s a powerful way to grow as a developer - I’ve learned so much by studying other projects’ code, especially for frontend work where seeing real implementations of animations and components taught me best practices
Documentation is as crucial as code - in Instructor, we maintain a 7:1 ratio of documentation to code lines, ensuring quality and accessibility
The best learning happens by doing - whether it’s fixing bugs, writing docs, or implementing features

The Art of Saying No

Managing an open source project means making tough decisions:

Not every feature request should be accepted - even if it takes 10 minutes to implement, it could mean years of maintenance
Stay focused on core functionality - requests like conversation state management or fallback models might be useful, but could detract from our main purpose
Cognitive load matters - each new feature adds complexity that maintainers carry forward

The Power of Community

Open source truly shines through collaboration:

Problems get solved once and benefit everyone - like Gemini’s function calling implementation that maps Pydantic models to their protobuf structure
Security improvements come from unexpected places - when we launched, users contributed sandboxed environments and better security practices
It’s about finding your tribe - connecting with others who care about the same problems and approach them similarly

In short

The real magic of open source isn’t just about sharing code - it’s about creating solutions that persist and grow through community effort. It’s about the friends we made along the way :)

Building Reliable Systems

So we’ve solved a class of parsing errors with structured outputs. That helps us get reliable outputs.

But how can we make sure our system is reliable?

3 main components

There really are 3 good starting points for building reliable systems

Investing in simple Evals early
Leveraging Synthetic Data
Tracking user data

Let’s see this in action

Evals

Now that we have Pydantic objects, the easiest evals that we can run are just simple unit tests

tests = [
    [
        "What is the average time to first response for issues in the azure repository over the last 6 months? Has this metric improved or worsened?",
        [RunSQLReturnPandas],
    ],
    [
        "How many issues mentioned issues with Cohere in the 'vercel/next.js' repository in the last 6 months?",
        [SearchIssues],
    ],
    [
        "What were some of the big features that were implemented in the last 4 months for the scipy repo that addressed some previously open issues?",
        [SearchSummaries],
    ],
 ]
 
for query, expected_result in tests:
    response = one_step_agent(query)
    for expected_call, agent_call in zip(expected_result, response):
        assert isinstance(agent_call, expected_call)

Evals

These evals don’t need to be perfect. We can write some simple unit tests to check that the output is of the correct type and that we’re getting the right outputs.

Most importantly, they’re here to help us clarify and iterate on the ideal outputs we expect. It’s about aligning the human to the model’s output.

Binary metrics provide a fast, reliable way to evaluate LLM system performance before investing in expensive evaluation methods. By nailing down the simplest metrics first, we can iterate quickly and make sure we’re on the right track before spending our time on more complex methods that might not be as useful.

RAG

Instead of immediately jumping to complex LLM-based evaluation, start by measuring retrieval quality:

def calculate_recall_at_k(queries: List[str], relevant_docs: List[str], k: int) -> float:
    hits = 0
    for query, relevant in zip(queries, relevant_docs):
        retrieved = retrieve_documents(query, k=k)
        if relevant in retrieved:
            hits += 1
    return hits / len(queries)

RAG

Most importantly, we can use these queries to test different approaches to retrieval because we can just run the same queries on each of them and see which performs the best. If we were comparing say - BM25, Vector Search and Vector Search with a Re-Ranker step, we can now identify what the impact on recall, mrr and latency is.

Text-2-SQL

Text-to-SQL is more complex than simple query generation. Consider how experienced data analysts actually work:

First, they identify relevant tables in the schema
Then, they search for similar queries others have written
Next, they consult stakeholders to understand what each field means, the hidden relationships between the fields and company-specific calculation methods
Finally, they write the query

Text-2-SQL

This is … just RAG again with a differerent context

By ensuring that we’re able to obtain the right context from the database, we can now test different approaches for retrieval. This ensures that given some sort of user query, our system always has the necessary context before attempting query generation. This also means that it’s easier to debug failures and progressively improve system performance in a measurable way.

Text-2-SQL

Therefore, we can measure the precision and recall of relevant tables + snippets given a synthetic question for each snippet/table.

Text-2-SQL results

This allows us to quantiatively know exactly how well each component of our system is performing

Text-2-SQL results

Tool Selection

We can also apply this to tool selection that we did earlier, measuring

Precision : Were the selected tools actually relevant?
Recall : Did we select all the expected tools?

This creates an important tension: we could achieve perfect recall by selecting every available tool, but this would tank our precision and likely create operational issues.

The balance between these metrics becomes crucial when considering real-world constraints. Unnecessary tool calls might incur API costs, add latency, or introduce potential failure points.

Metdata Filtering

We can also look at the ability of our language models to filter and generate metdata both at ingestion time and query time.

By ensuring that our language model can correctly extract and generate metadata, we can now test different approaches to filtering and retrieval.

Synthetic Data

In all four examples, we’ve seen how we can use synthetic data to evaluate our system. These are valuable because they allow us to test our system in a controlled environment and iterate on it.

Most importantly, we can test and generate these queries before we even ship to production. The key thing here is really aligning the human to the model’s output.

Query Understanding

Now that we’ve started to test our system gradually, it’s important to start thinking too about the specific kinds of queries we’re getting.

This is because it allows us to

identify general user patterns
prioritize feature development based on usage
discover emerging use cases over time

This helps us to focus on what really moves the needle - Segmentation is the name of the game here.

Queries

Ideally there really are two kinds of queries we are scared of

Queries with high volume that we get low satisfaction scores from
Queries that we get low satisfaction scores from but low volume

We always want to kill the second one because it’s a waste of resources. The first one is debatable.

Queries

We can use a few different methods

BERTopic - This is a topic modelling library that can be used to cluster queries by their semantic meaning. We get semantic clusters with embeddings
Kura - This is a library that clusters libraries using language models which I’ve built out recently

We’ll look at Kura since that’s what I built out recently.

Clusters

Eventually when we ship to production, we’ll want to know how well we’re performing for specific queries.

Topic modelling allows us to discover potential clusters/categories that are doing well or very bad.

Clustering Mechanism

Kura takes a hierarchical approach to clustering:

Summarizes individual conversations into high-level user requests
Groups similar requests into initial clusters
Progressively combines clusters into broader categories
Generates human-readable descriptions for each level

For example, what might start as separate clusters for “React component questions” and “Django API issues” could be combined into a higher-level “Web Development Support” category that better reflects the technical nature of these conversations.

Try Kura

Please give kura a try at www.usekura.xyz if you have some time. I’d love to hear your feedback.

Conclusion

Building reliable LLM applications in 2025 requires a systematic approach focused on measurable outcomes. By combining structured outputs, binary metrics, and sophisticated query understanding, teams can create robust systems that improve steadily over time:

Structured outputs through function calling and libraries like Instructor provide a foundation for reliable data handling and type safety, enabling sophisticated workflows and easier debugging
Simple binary metrics like recall and precision offer quick feedback loops for iterating on core functionality before investing in complex evaluations
Tools like Kura help teams understand usage patterns at scale, directing development efforts where they’ll have the most impact

Conclusion

Most importantly, this approach transforms gut feelings into quantifiable metrics, allowing teams to make data-driven decisions about system improvements.

The goal isn’t to build a perfect system immediately, but to create one that can be measured, understood, and systematically enhanced over time.

To find out more, please visit improvingrag.com