AIJune 4, 2026

The eval harness is the product

Production-grade AI is not a clever prompt. It's a system with evaluation, observability, caching, guardrails, and cost control designed in from day one.

The eval harness is the product

A demo is easy. You write a good prompt, the model does something impressive once, and everyone in the room nods. Then it ships, and the next ten thousand inputs are not the one you demoed. Some of them are adversarial, some are malformed, some are just unusual, and the clever prompt that wowed the room starts producing answers that are confidently wrong. The pilot that never reached production usually died right here — not because the model was incapable, but because nobody built the system that makes a capable model trustworthy.

That system is the work. The prompt is the smallest part of it.

Bolt-on prompts versus production AI

The difference is whether evaluation, observability, caching, guardrails, and cost control were designed in from the first day or bolted on after the first incident. Bolt-on AI is a prompt wired directly to a model wired directly to a user, with hope in between. Production AI wraps that core flow in an envelope of concerns, and the most important one is evaluation.

The harness comes first

Before we tune a prompt, we build the thing that tells us whether a change made the system better or worse: a set of golden inputs with known-good outputs, and a harness that runs the whole system against them on every change. This is unglamorous and it is the entire game. Without it, every prompt edit is a vibe — you change a word, the demo still looks fine, and you have no idea what you broke for the inputs you didn't think to try. With it, a regression is caught before it deploys, and "the new version is better" becomes a measurement instead of an opinion.

We treat the eval harness as the product because, operationally, it is. The prompt will change a hundred times. The model will change underneath you. The harness is the thing that lets you make those changes without fear, which means it's the thing that lets the system survive past the demo.

The rest of the envelope

Around the core flow — query, retrieve, model, structured output, response — the other concerns earn their place:

Observability. Traces and metrics on every call, so when something goes wrong in production you can see the actual prompt, the actual retrieval, and the actual output, not a guess.

Caching. Prompt and result caching, designed in, because the same questions get asked repeatedly and you should not pay a model to answer them twice.

Guardrails. Structured, validated output — not free text you parse with a regex and a prayer. The model returns something the system can check, and inputs that shouldn't be processed get stopped before they are.

Cost control. Model choice fit to the workload, not to the slide. A cheap, fast model for the ninety percent of calls that are routine; the expensive model reserved for the synthesis that actually needs it. Cost modeled before the build, not discovered in the bill.

Pragmatic about models, rigorous about systems

We're deliberately unromantic about which model to use. The right one is the one that fits the workload and the budget, and that answer changes as the providers leapfrog each other. What doesn't change is the system around the model. A good harness, good observability, and good guardrails make a mediocre model usable and a great model dependable. The absence of them makes even the best model a liability the moment it leaves the demo.

If you have an AI pilot that impressed everyone and then stalled on the way to production, the missing piece is almost never a better prompt. It's the harness that was never built.

Have an AI pilot that never reached production? Start a conversation.