3 main components
There really are 3 good starting points for building reliable systems
- Investing in simple Evals early
- Leveraging Synthetic Data
- Tracking user data
Let’s see this in action
Building tooling for Reliable LLM Applications
For a lot of language models nowadays, reliability and consistency are paramount.
As we shift from toy demos to enterprise level applications, building applications that are accurate, consistent and able to handle larger workloads is a new challenge.
Instructor is like a good Japanese knife. It does one thing really well and that’s structured outputs.
We don’t hide much from the user or do anything fancy. We just give you
Let’s see this in action
import instructor
from openai import OpenAI
from pydantic import BaseModel
class ExtractUser(BaseModel):
name: str
age: int
client = instructor.from_openai(OpenAI())
resp = client.chat.completions.create(
model="gpt-4",
response_model=ExtractUser,
messages=[{"role": "user", "content": "John Doe is 30 years old."}],
)
assert isinstance(resp, ExtractUser)
assert resp.name == "Jason"
assert resp.age == 25
import instructor
from anthropic import Anthropic
from pydantic import BaseModel
class ExtractUser(BaseModel):
name: str
age: int
client = instructor.from_anthropic(Anthropic())
resp = client.chat.completions.create(
model="claude-3-5-sonnet-20240620",
response_model=ExtractUser,
messages=[{"role": "user", "content": "Extract Jason is 25 years old."}],
)
assert isinstance(resp, ExtractUser)
assert resp.name == "Jason"
assert resp.age == 25
import instructor
import google.generativeai as genai
from pydantic import BaseModel
class ExtractUser(BaseModel):
name: str
age: int
client = instructor.from_gemini(client=genai.GenerativeModel(model_name="models/gemini-1.5-flash-latest"))
resp = client.chat.completions.create(
response_model=ExtractUser,
messages=[{"role": "user", "content": "Extract Jason is 25 years old."}],
)
assert isinstance(resp, ExtractUser)
assert resp.name == "Jason"
assert resp.age == 25
Notice how easy it is to switch providers, as the pattern remains unchanged.
Sure there are some small changes that are provider specific (Eg. Gemini ) but otherwise the core logic remains the same.
You get a validated Pydantic model at all times. But what happens when our needs get more advanced?
import instructor
from pydantic import BaseModel
class User(BaseModel):
name: str
age: int
client = instructor.from_openai(openai.OpenAI())
user_stream = client.chat.completions.create_partial(
model="gpt-4-turbo-preview",
messages=[{"role": "user", "content": "Create a user"}],
response_model=User,
)
for user in user_stream:
print(user) # Watch the response build incrementally
Watch as our response builds incrementally in real-time:
We also support jinja templating. You can use the same values for prompt formatting as your validation logic.
import openai
import instructor
from pydantic import BaseModel, field_validator, ValidationInfo
class QueryFilters(BaseModel):
materials: list[str]
@field_validator("materials")
def check_materials(cls, value: list[str], info: ValidationInfo) -> list[str]:
# Ensure every material is provided as a non-empty string.
for material in value:
if not (isinstance(material, str) and material.strip()):
raise ValueError("Every material in the list must be provided as a non-empty string.")
return value
async def extract_query_filters(
query: str, client: openai.AsyncOpenAI, materials: list[str]
) -> QueryFilters:
return await client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """
You are a helpful assistant that classifies clothing.
References:
- Materials: {{ materials }}
Output the clothing material that y've identified
"""
},
{"role": "user", "content": query},
],
context={
"materials": materials,
},
response_model=QueryFilters,
)
By using structured outputs and instructor, we can abstract away a whole class of errors
But it’s more than just JSON -> Python class
While most people think of structured outputs as a way to parse JSON function calls, I think using Pydantic has a few key benefits that people don’t realise
Let’s see an example
From our own experiments, changing field names can impact accuracy by up to 60%.
From our own experiments, changing field names can impact accuracy by up to 60%.
We can also define class methods that can interface directly with our systems.
class SearchIssues(BaseModel):
"""
Use this when the user wants to get original issue information from the database
"""
query: Optional[str]
repo: str = Field(
description="the repo to search for issues in, should be in the format of 'owner/repo'"
)
async def execute(self, conn: Connection, limit: int):
if self.query:
embedding = (
OpenAI()
.embeddings.create(input=self.query, model="text-embedding-3-small")
.data[0]
.embedding
)
args = [self.repo, limit, embedding]
else:
args = [self.repo, limit]
embedding = None
sql_query = Template(
"""
SELECT *
FROM {{ table_name }}
WHERE repo_name = $1
{%- if embedding is not none %}
ORDER BY embedding <=> $3
{%- endif %}
LIMIT $2
"""
).render(table_name="github_issues", embedding=embedding)
return await conn.fetch(sql_query, *args)
We can also define logic to repair and fix faulty values
from pydantic import BaseModel, field_validator
from fuzzywuzzy import fuzz
class Product(BaseModel):
category: str
@field_validator("category")
def validate_category(cls, v: str) -> str:
known_categories = ["t-shirts", "shirts", "pants"]
matches = [(fuzz.ratio(v.lower(), cat), cat) for cat in known_categories]
best_match = max(matches, key=lambda x: x[0])
if best_match[0] > 80: # Threshold for fuzzy matching
return best_match[1]
raise ValueError(f"No matching category found for {v}")
This is useful when we have say a list of categories that might have mispellings or are slightly different.
Open source has been an incredible learning journey. Here’s what I’ve discovered:
Managing an open source project means making tough decisions:
Open source truly shines through collaboration:
The real magic of open source isn’t just about sharing code - it’s about creating solutions that persist and grow through community effort. It’s about the friends we made along the way :)
So we’ve solved a class of parsing errors with structured outputs. That helps us get reliable outputs.
But how can we make sure our system is reliable?
There really are 3 good starting points for building reliable systems
Let’s see this in action
Now that we have Pydantic objects, the easiest evals that we can run are just simple unit tests
tests = [
[
"What is the average time to first response for issues in the azure repository over the last 6 months? Has this metric improved or worsened?",
[RunSQLReturnPandas],
],
[
"How many issues mentioned issues with Cohere in the 'vercel/next.js' repository in the last 6 months?",
[SearchIssues],
],
[
"What were some of the big features that were implemented in the last 4 months for the scipy repo that addressed some previously open issues?",
[SearchSummaries],
],
]
for query, expected_result in tests:
response = one_step_agent(query)
for expected_call, agent_call in zip(expected_result, response):
assert isinstance(agent_call, expected_call)
These evals don’t need to be perfect. We can write some simple unit tests to check that the output is of the correct type and that we’re getting the right outputs.
Most importantly, they’re here to help us clarify and iterate on the ideal outputs we expect. It’s about aligning the human to the model’s output.
Binary metrics provide a fast, reliable way to evaluate LLM system performance before investing in expensive evaluation methods. By nailing down the simplest metrics first, we can iterate quickly and make sure we’re on the right track before spending our time on more complex methods that might not be as useful.
Instead of immediately jumping to complex LLM-based evaluation, start by measuring retrieval quality:
Most importantly, we can use these queries to test different approaches to retrieval because we can just run the same queries on each of them and see which performs the best. If we were comparing say - BM25, Vector Search and Vector Search with a Re-Ranker step, we can now identify what the impact on recall, mrr and latency is.
Text-to-SQL is more complex than simple query generation. Consider how experienced data analysts actually work:
This is … just RAG again with a differerent context
By ensuring that we’re able to obtain the right context from the database, we can now test different approaches for retrieval. This ensures that given some sort of user query, our system always has the necessary context before attempting query generation. This also means that it’s easier to debug failures and progressively improve system performance in a measurable way.
Therefore, we can measure the precision and recall of relevant tables + snippets given a synthetic question for each snippet/table.
This allows us to quantiatively know exactly how well each component of our system is performing
We can also apply this to tool selection that we did earlier, measuring
This creates an important tension: we could achieve perfect recall by selecting every available tool, but this would tank our precision and likely create operational issues.
The balance between these metrics becomes crucial when considering real-world constraints. Unnecessary tool calls might incur API costs, add latency, or introduce potential failure points.
We can also look at the ability of our language models to filter and generate metdata both at ingestion time and query time.
By ensuring that our language model can correctly extract and generate metadata, we can now test different approaches to filtering and retrieval.
In all four examples, we’ve seen how we can use synthetic data to evaluate our system. These are valuable because they allow us to test our system in a controlled environment and iterate on it.
Most importantly, we can test and generate these queries before we even ship to production. The key thing here is really aligning the human to the model’s output.
Now that we’ve started to test our system gradually, it’s important to start thinking too about the specific kinds of queries we’re getting.
This is because it allows us to
This helps us to focus on what really moves the needle - Segmentation is the name of the game here.
Ideally there really are two kinds of queries we are scared of
We always want to kill the second one because it’s a waste of resources. The first one is debatable.
We can use a few different methods
We’ll look at Kura since that’s what I built out recently.
Eventually when we ship to production, we’ll want to know how well we’re performing for specific queries.
Topic modelling allows us to discover potential clusters/categories that are doing well or very bad.
Kura takes a hierarchical approach to clustering:
For example, what might start as separate clusters for “React component questions” and “Django API issues” could be combined into a higher-level “Web Development Support” category that better reflects the technical nature of these conversations.
Please give kura a try at www.usekura.xyz if you have some time. I’d love to hear your feedback.
Building reliable LLM applications in 2025 requires a systematic approach focused on measurable outcomes. By combining structured outputs, binary metrics, and sophisticated query understanding, teams can create robust systems that improve steadily over time:
Most importantly, this approach transforms gut feelings into quantifiable metrics, allowing teams to make data-driven decisions about system improvements.
The goal isn’t to build a perfect system immediately, but to create one that can be measured, understood, and systematically enhanced over time.
To find out more, please visit improvingrag.com
Structured Outputs and OSS