Engineering May 20, 2026

Evals before vibes.

Most AI features ship on gut feel. The model seemed right in staging, the demo went well, production is different. Evaluation sets are not optional infrastructure. They are the handbrake.

There is a specific kind of production incident that only exists in AI products. The code did not change. The model did not change. The outputs changed. Someone edited the prompt, or the model provider pushed a silent update, or the input distribution shifted after a feature launch. The behavior regressed, and no one knows when it started.

The reason this happens is not that the team is careless. It is that the team is measuring the wrong thing. They are measuring confidence instead of correctness. Vibes instead of evals.

What an eval set actually is.

An eval set is a collection of inputs with expected outputs and a rubric for judging whether a response passes. It does not need to be large. Twenty well-chosen examples will catch more regressions than a thousand random samples, because the twenty were chosen to cover the failure modes you already know about.

Write your golden set before you write your prompt.

That ordering matters. If you write the prompt first and then construct the eval set from its outputs, you are writing a test that your implementation already passes. The test tells you nothing about whether the implementation is correct. It only tells you that it is consistent with itself.

Define the inputs first. Define what a good output looks like. Then build the prompt to pass those cases. The constraint makes the work more rigorous. It also makes the scope clearer: you are not building an AI that does something vague. You are building an AI that passes specific tests.

The minimum viable eval.

A spreadsheet with three columns: input, expected output, and pass/fail criteria. Run it manually before every prompt deploy. Record the results. Keep the history.

This is not impressive. It does not require infrastructure. It requires discipline. The teams that build reliable AI products are not doing something more sophisticated than this at the start. They are doing this consistently, and then automating it once the manual version is too slow to run before every change.

The automation comes later: eval runners, regression dashboards, CI checks that block deploys when accuracy drops below a threshold. But the habit — check the behavior before you ship, check it against a fixed standard, not your memory of what it used to do — that habit is what the tooling encodes. The tooling does not create the discipline. It enforces it.

Model updates are silent deploys.

Every AI product running on a third-party model is implicitly accepting silent deploys. Providers update model weights. They improve behavior in some dimensions and change it in others. The contract is the API surface, not the behavior.

An eval set is the mechanism by which you learn about these changes before your customers do. Run your evals on a schedule. When accuracy drops on a subset of cases, investigate before attributing the change to other causes. Model providers do not send changelogs for inference behavior. Your eval set is the changelog.

Vibes have a place.

This is not an argument against intuition. Vibes are how you find the right direction fast. When a new model comes out, you run a few prompts and feel whether it is worth investigating. When a prompt edit looks wrong, you do not need a formal eval to notice. Human judgment is fast and cheap for obvious cases.

The problem is when vibes are the only check. When the demo is the entire QA process. When "it worked when I tested it" is the release criteria. At that point, vibes are not a first filter. They are the only filter. And they will let through the failure modes that do not show up in demos, which are the failure modes that matter most in production.

Evals before vibes means: run the check, then trust the feeling. Not instead of it.