Your agent works. Now what?
"Works in a demo" is not the same as "works in production."
We talk to a lot of engineering teams thinking about building custom agents. One thing we’ve noticed over the last few months is that regardless of company, use case, or technology stack, the conversation tends to drift to the same place.
To quote a CTO at a healthcare startup on a recent call: “It’s so easy to build a quick proof of concept. As soon as it hits prod, it falls apart.”
Nobody calls us to talk about the demo. They’ve usually already got a hand-rolled prototype hitting OpenAI’s API or they’ve chained a few n8n nodes together. What they usually want to talk about is everything around it: testing, rollbacks, observability, model volatility, deployment velocity.
After enough of these calls, we decided to write up our common guidance in one place. The result is “How to Build an AI Agent”, a long-form field guide that gets into some theses we’ve developed internally about what “good” actually looks like for an agent. Think of it as answering two key questions:
What actual properties separate cute prototypes from agents taking meaningful, business-critical actions in production?
How do you go about actually building those production-ready agents?
The part I wanted to touch on here is our framework for answering question 1: inspired by the idea of the 12-factor app, we lay out 6 properties that make the difference between an agent that “sort of, mostly works” and one you can actually run in production.
The six properties
1. Reliable Responses
An LLM returns strings: sometimes they’re valid JSON, sometimes they’re almost-valid JSON, and - dangerously - sometimes they’re structurally correct but semantically wrong (”confidence”: “high” when your system expects a float between 0 and 1).
The solution here is consistently enforced, typed contracts. You need to validate two things: inputs before they hit the model, and outputs before they hit your app. We’ve previously written about how important this is for reliable agentic coding as well (end-to-end types with TypeScript, Kysely for Postgres, OpenAPI for API contracts).
The most common mistake we see is trying to prompt your way to reliability. You add sentences like “Please ensure you only return JSON” or “Do not include markdown formatting.” This works often enough to be deceptive, but ensure you have an engineered system leveraging structured output, schema validation with retries, or your mechanism of choice to ensure you’re enforcing a data contract with the LLM.
2. Testability
This is where we see the biggest gap on calls with prospects and customers. Teams know they should be testing, but testing non-deterministic system is tough.
The answer is two layers: evals and back-tests. Evals check a pre-defined golden set of inputs where the ideal answer is known, and you measure whether the new version is equivalent, better, or worse. It’s usually a fuzzy comparison using semantic similarity, but for some use cases can be more discrete assertions. Back-tests test your new prompt and/or model configuration against real-world historical examples and compares how results drifted.
Testing modern agents continues to become even more challenging as more and more modern agents are doing sophisticated tool usage. If a tool mutates a system or does something expensive, you need to mock it out. But you won’t know the exact values the non-deterministic LLM will call you with, so hard-coded mocks or exact-match replays won’t work. (shameless plug: Logic handles this for you automatically and transparently).
Without tests every change to a prompt or model update is a gamble. It’s tempting to manually try a few inputs, and if it looks good, ship it. That always comes to bite you eventually.
3. Version Control
For software engineers this sounds obvious, but very rarely is it done correctly. The idea here is that the lifecycle of your agent(s) will likely be quite distinct from the codebase that calls them. Everything from ownership to deployments and testing will vary from your standard services.
You’ll often even want to be able to run multiple versions of your agent in parallel, and if you’re not using an agentic platform that facilitates that you should give that platform serious reconsideration.
An agent’s behavior depends on many factors, including the prompt, the model config, schemas, tool definitions, the RAG knowledge base, etc. It’s important that you version the entire bundle and are able to roll back to a known-good state quickly.
4. Observability & Logging
A lot of teams deploy agents and then fly blind, with no way to debug once an issue is found. When things go wrong you want full execution traces: every input, every output, every tool call, reasoning, etc. You want to know how many experiences were impacted and by whom. Without traces, you’re debugging a non-deterministic system where, even if you gave the system the exact same inputs, you may not replicate the customer’s bad experience.
5. Model independence
Most teams start with one model and hardcode it throughout their stack. That’s fine for prototyping. It becomes a problem when a model you depend on gets deprecated, or has performance issues, or a competitor releases something that performs comparably for half the cost, and swapping it out requires reworking your entire agentic stack because the way Gemini handles function calls and structured output is quite different than OpenAI.
A better approach is to decouple model concerns from your agent logic entirely. Your agentic platform should allow easy and dynamic routing. Simple classification tasks might go to a fast, cheap model; complex requests checks go to a large frontier model.
You’re always navigating a triangle of quality, speed, and cost, and no single model wins on all three. If your agent is hardcoded to one provider or one model, you’ve locked yourself into more than it might feel like at first glance.
6. Robust deployments
A prompt change shouldn’t require a full code deploy. If your compliance team needs to update a policy rule in the agent and that update has to go through the standard CI/CD pipeline, wait for a code review, and ship with the next release, you’ve created a bottleneck by coupling two lifecycles that should be separate.
Behavioral updates to agents should ship independently, with their own versioning and their own rollback mechanism. This also means the people closest to the problem should be able to refine how the agent behaves without touching service code. And ideally, before a new version goes live, you can run it in shadow mode against real production data with its outputs silenced and tool calls mocked, so you can compare performance before flipping the switch.
Building your agent
These six properties are the checklist we keep coming back to when we evaluate whether an agent is actually ready for production.
The rest of our guide walks through building the same product listing classifier three ways: code-driven (Python, Pydantic, OpenAI SDK), visual workflow (e.g. n8n), and spec-driven (Logic), with a discussion of the tradeoffs around each approach.
Read the full guide here and let us know what you think.



