Engineering May 06, 2026

Prompts need version control.

WizPrompt exposed the obvious failure mode: production behavior changing because a prompt changed in place. Once prompts are treated like code, the debugging gets honest.

There is a moment in every LLM engineering team's history when something breaks in production and no one can explain why. The model is the same. The code is the same. The output is different. After an hour of investigation, someone checks the prompt — and there it is. An edit made two days ago, no diff, no record, no way to know what it changed from.

This is the prompt management failure mode. It is completely avoidable and almost universal.

The spreadsheet problem.

Most teams that are early in LLM product development store prompts in spreadsheets, Notion pages, or comments in application code. This is reasonable at the start. The prompts are simple, the team is small, the product is moving fast.

The problem arrives when the team grows, the prompts grow in complexity, and the volume of production AI behavior starts to matter. At that point, the spreadsheet becomes a liability. There are no diffs. There is no history. There is no way to run an eval against the old version and the new version before deploying. There is no rollback path.

A prompt without history is a production incident waiting for a name.

We built WizPrompt because we needed it ourselves, and because the gap between how teams manage code and how they manage prompts was wider than it should have been in 2025.

What version control gives you.

Diffs. The most basic thing. Being able to see exactly what changed between the version running in production and the version you are about to deploy. This sounds trivial and is not, because without it, the review process is either skipped or done by comparing two documents side by side, manually, under pressure.

Eval runs before merge. The WizPrompt pattern is: branch, edit, run evals, review results, merge. The eval runs compare the proposed version against a golden set of inputs and expected outputs. Regression alerts fire if accuracy drops below a threshold. This turns "I think this edit is better" into "the eval set agrees, or it does not."

Rollback. When something breaks in production — and it will — having a rollback path that is one command instead of a manual archaeology session through a Notion page history is the difference between a five-minute incident and a fifty-minute one.

Multi-model comparison as a forcing function.

An unexpected benefit of side-by-side model comparison: it reveals how model-specific most prompts are. A prompt that works well on Claude often degrades on GPT-4o, not because either model is worse, but because the prompting conventions differ. Anthropic models tend to respond well to explicit structural guidance. OpenAI models have different sensitivities.

Running the same prompt across providers before deployment surfaces this before it becomes a customer-facing problem. It also makes prompt authors more precise — when you know the prompt will be tested against multiple models, you write it more carefully.

The deploy discipline matters most.

The biggest behavioral change WizPrompt creates is not the version history or the evals. It is the habit of treating a prompt deploy the same way you treat a code deploy. Review required. Eval passing. Merge to main. SDK pulls the merged version automatically.

That habit is the actual product. The software is just what enforces it.