002 · Own product · 2025

WizPrompt

Git-style prompt versioning with branches, eval runs, and deploy-on-merge. Brought 800+ prompts under version control across 9 active engineering teams.

Timeline

6 weeks

Role

Product / engineering

Prompts versioned

800+

Active teams

9 teams

WizPrompt interface showing prompt version history and eval runs

The problem

Teams shipping LLM products have no discipline around prompt changes.

The pattern is everywhere: prompts live in spreadsheets, Notion docs, or worse — hardcoded in application config. When something breaks in production, there's no diff to show what changed. When a prompt update regresses a hot path, the team has no baseline to roll back to.

The result: every prompt deploy is a manual, anxiety-producing process. Engineers copy-paste from a doc into production code during a deployment. There's no eval run before the change goes live. There's no history.

The approach

Prompt management that borrows the best ideas from git and CI/CD.

WizPrompt gives prompts the same lifecycle discipline as code. Every prompt lives in a versioned branch. Changes go through a PR process with side-by-side eval runs before merging.

Side-by-side eval runs compare the current production prompt against the proposed change across Anthropic, OpenAI, Google, and local models. Regression alerts fire automatically when accuracy drops below a configured threshold on any eval set.

Deploy-on-merge means the production prompt is always the merged version — no manual copy-paste, no drift between the source-of-truth doc and what's actually running. Integrates via SDK so existing pipelines don't need to change their calling convention.

What we learned

The eval problem is harder than the versioning problem.

Getting prompts under version control is straightforward engineering. The hard part is writing evals that are actually predictive. Most teams don't have golden test sets — they have vague feelings about whether the model is "doing the right thing."

WizPrompt ships with eval scaffolding that makes it easy to define golden input/output pairs and correctness criteria. The first time a team runs a structured eval on a prompt they've been using in production for months, they usually find something surprising. That's the product doing its job.

Multi-model comparison also surfaces a second insight: prompt quality is model-specific. A prompt tuned for Claude often degrades on GPT-4o. The side-by-side view makes this visible before it becomes a production incident.

Result

800+ prompts versioned. No more copy-paste deploys.

Across 9 active engineering teams, WizPrompt brought all production prompts under version control with full diff history and pre-deploy eval coverage. Zero manual copy-paste deploys since rollout.

"We caught a regression before it shipped for the first time in a year. That alone was worth it."

The product is live at wizprompt.pro. Multi-provider routing and shared eval libraries are shipping in Q3 2026.

Stack

Technologies used

Next.js TypeScript Anthropic SDK OpenAI SDK Postgres Vercel Google AI SDK

Need prompt infrastructure?

We take a small number of client engagements each year. Real problems only.

Send a brief