Evals that catch regressions before users do.

A prompt change that looks better on three hand-picked examples is not better. Evals are how you know: a fixed set of cases, a pass/fail rubric, and a gate that fails loud before the regression reaches anyone.

Write the golden set before the prompt.

Most teams write the prompt first and judge it by reading a few outputs. That is how regressions ship. You tuned for the three cases you happened to paste in, and you have no idea what the other forty look like.

Invert the order. Before you touch the prompt, write the cases. Each case is a representative input, the expected output (or the properties the output must have), and a rubric that decides pass or fail. The set is the spec. The prompt is just the current attempt at satisfying it.

Representative means the spread you actually see: the boring happy path, the empty input, the adversarial one, the multilingual one, the one with a malformed field. If a class of input matters in production, it gets a case.

A pass/fail rubric per case, not a vibe.

"Looks good" does not survive contact with a second person or a second week. Every case needs a check that returns a boolean.

Some checks are exact: the JSON parses, the category is one of the allowed five, the refund amount equals 42.50. Some are structural: the answer cites at least one source, contains no PII, stays under the length cap. The fuzzy ones — tone, helpfulness, faithfulness — get an LLM-as-judge with a tight rubric and a low-temperature call, scored against a reference. Judges drift, so keep a handful of cases with known-correct human labels to check the judge itself.

Each case resolves to pass or fail on its own terms. You are not averaging a vague quality score. You are counting how many specs hold.

A harness small enough to run on every change.

The harness is not a framework. It is a loop over cases, a call to your prompt, and a check. Keep it boring so it runs in CI without ceremony.

// evals/run.ts
import { cases } from "./golden";
import { runPrompt } from "../src/prompt";

let failed = 0;
for (const c of cases) {
  const out = await runPrompt(c.input);
  const { pass, reason } = c.check(out);
  if (!pass) {
    failed++;
    console.error(`FAIL ${c.id}: ${reason}`);
  }
}
console.log(`${cases.length - failed}/${cases.length} passed`);
process.exit(failed === 0 ? 0 : 1);

Gate the deploy on the result.

An eval suite that does not block anything is a dashboard nobody reads. Wire it into CI and make a non-zero exit fail the build.

Run it on every prompt edit, every model swap, every temperature or tool-definition change — the things that have no type signature and no compiler to catch them. Moving from Claude Sonnet 4.6 to Claude Opus 4.8 is a code change as far as the gate is concerned: same suite, same bar.

Set the bar where it belongs. For most surfaces that is 100% of the golden set passing, because every case earned its place. If you allow a known-failing case, mark it as an expected failure with a reason, so a silent regression can never hide inside a fuzzy threshold.

Regression detection: you get the commit that broke it.

This is the payoff. When the suite goes red, the diff that turned it red is right there. You are not bisecting production complaints a week later; you have the offending prompt edit and the exact cases it broke before it merged.

Per-case output makes the failure legible. `FAIL refund_partial: expected 42.50, got 45.00` tells you what changed and where to look. A failing eval is a stack trace for behavior.

That is the difference between "users started complaining about refunds on Tuesday" and "PR #418 broke the partial-refund case, here is the line."

Feed production traces back so the set grows toward what breaks.

Your golden set starts as guesses about what matters. Production tells you what actually matters. Sample real traces — especially the ones that errored, got a thumbs-down, hit a fallback, or triggered a refusal — and promote the interesting ones into cases.

Each promoted trace becomes a permanent test: capture the input, write the expected behavior, add the check. The bug you saw once cannot return silently, because now it is gated.

Over months the set stops being your imagination and becomes a map of the system's real failure surface. It grows toward the cases that break, which are the only cases worth defending.

Turning vibes into proof.

"This prompt feels better" is an opinion. "This prompt passes 48 of 48, the old one passed 45" is a fact you can put in a PR.

The shift is cultural as much as technical. Prompt changes get reviewed like code because they produce evidence like code. You can compare two models, two phrasings, two temperatures, and point at numbers instead of arguing taste.

Evals will not make the model perfect. They make its behavior legible and its regressions loud, which is the whole job. Catch it in CI, or your users catch it for you.

Common questions

How many cases does a golden set need to be useful?

Start with 20 to 50 hand-written cases covering the input classes you care about: happy path, edge cases, adversarial, malformed. That is enough to catch obvious regressions on day one. The set then grows from production traces, not from trying to enumerate everything up front. Coverage of real failure modes matters more than raw count.

When should I use an LLM judge versus a deterministic check?

Prefer deterministic checks whenever the property is mechanical: valid JSON, allowed category, exact value, length cap, no PII, required citation. They are fast, free, and never drift. Reach for an LLM judge only for genuinely fuzzy properties like tone, helpfulness, or faithfulness. Score at low temperature against a reference, and keep human-labeled cases to verify the judge itself stays honest.

Does gating the deploy on evals slow the team down?

It moves the cost earlier, which is cheaper. A suite of a few dozen cases runs in CI quickly and tells you exactly which change broke which case before it merges. The alternative — shipping a silent regression and reconstructing it from user complaints a week later — is far slower and far more expensive.