Isn't it simpler to just send everything to the frontier model?

Simpler to write, worse to run. Most traffic — classification, extraction, routing — doesn't need frontier reasoning, so you'd pay the top tier and accept higher latency for work a fast model handles. Reserve Claude Opus 4.8 for the calls that genuinely turn on hard reasoning, and route bulk work to Claude Haiku 4.5 or Claude Sonnet 4.6.

Why reroute on a refusal instead of just returning it?

A refusal means the call succeeded but the content path didn't produce a usable answer, so retrying the same provider only adds latency. Treating it as a reroute signal — alongside a 429 rate limit and a 529 overloaded — lets the router try another endpoint for the same tier inside the request, before the user sees a failure. Genuine errors should still bubble up; only reroute the signals you can recover from.

Does prompt caching still help when a fallback reroutes to another provider?

The cache is per provider, so a reroute lands on a cold prefix and you pay the full first-call cost there. It still helps on the primary path, which serves most requests. Just account for the cold prefix in the task's latency budget so a reroute degrades predictably rather than blowing the budget.

Architecture June 13, 2026

Multi-model routing that survives an outage.

A router that picks the right model per task and reroutes underneath when a provider refuses, throttles, or overloads. The goal is narrow: a bad minute at one provider should never reach the user.

Two decisions, not one.

People collapse "which model" and "which provider" into one config line. They are different decisions and they fail differently.

Model selection is about fit: a fast model for classification and extraction, a frontier model for reasoning that actually needs it, a cheaper tier for bulk work. Provider routing is about availability: the same model served through more than one endpoint, so a single bad upstream doesn't take down the request.

Keep them separate in code. The task picks a tier; the tier resolves to an ordered list of providers. Mix the two and you end up rerouting a hard reasoning job onto a model that can't do it.

Route by task, not by habit.

The default failure is sending everything to the best model because it's the one you trust. That's a habit, not a decision. Most production traffic is not hard reasoning — it's tagging a ticket, pulling fields out of a document, deciding which of three branches to take.

Classify the work first, then assign a tier. Classification and extraction go to a fast model like Claude Haiku 4.5. Summarization and routine structured generation sit in the middle with Claude Sonnet 4.6. The genuinely hard calls — ambiguous planning, multi-step reasoning, the edge cases that decide whether the product feels smart — go to Claude Opus 4.8.

If you can't say why a request needs the frontier tier, it probably doesn't. Cheap tiers handle the bulk; the frontier tier earns its cost on the calls that matter.

Fallbacks that fire inside the request.

A model is served through more than one provider. When the primary returns a refusal, a 429 rate limit, or a 529 overloaded, the router walks to the next provider for the same tier — inside the same request, before the user sees anything.

Three signals, three meanings. A 429 means you're throttled: back off and try the next endpoint. A 529 means the upstream is overloaded: don't hammer it, reroute. A refusal is the subtle one — the call succeeded, but the content path declined, so retrying the same provider just burns latency. Treat all three as reroute signals, not as errors to bubble up.

The user gets one slower answer instead of one failed request. That's the whole point.

A small router.

The shape that holds up: a tier map from task to an ordered provider list, and a loop that tries each provider until one returns a response you can use. Keep the fallback policy in one place so every task inherits it.

Note two things in the code. There's no retry storm — each provider gets one attempt per request, and the ordering does the work; if you need backoff on 429, add it per-provider, not as a blanket sleep that taxes every healthy path. And the loop checks that the response is usable before returning it, because a provider can answer with a refusal on the 200 path — a successful call is not the same as a usable answer.

type Tier = "fast" | "balanced" | "frontier";

const TIERS: Record<Tier, Provider[]> = {
  fast:      [haikuPrimary, haikuSecondary],
  balanced:  [sonnetPrimary, sonnetSecondary],
  frontier:  [opusPrimary, opusSecondary],
};

const REROUTE = new Set(["refusal", "rate_limit", "overloaded"]); // 429 / 529 / refusal

async function route(task: Task, input: Input) {
  const tier = classify(task); // pick fit, not habit
  let last: Response | Error | undefined;

  for (const provider of TIERS[tier]) {
    try {
      const res = await provider.call(input, { budgetMs: task.budgetMs });
      if (!REROUTE.has(res.reason)) return res; // usable answer
      last = res;                                // refusal on the 200 path: reroute
    } catch (err) {
      if (!REROUTE.has(reason(err))) throw err;  // real error: don't mask it
      last = err as Error;
    }
  }
  throw new AllProvidersExhausted(tier, last);
}

Budgets are part of the contract.

Every task carries a latency budget and a cost ceiling. They aren't dashboards you check later; they're inputs the router enforces while the request is live.

A latency budget bounds how long you'll chase fallbacks before giving up gracefully. A cost ceiling stops a high-volume path from quietly routing bulk traffic to the frontier tier because one branch got lazy. When a request would blow either budget, the router degrades on purpose rather than running long and expensive in silence.

Budgets also make the system observable. Log the tier, the provider that answered, and where the time went, and a regression shows up as a number instead of a vibe.

Cache the stable prefix.

Most prompts have a large fixed head — system instructions, tool definitions, few-shot examples — and a small variable tail. Caching that stable prefix cuts the cost and latency of re-sending it on every call.

Order the prompt so the cacheable part comes first and the request-specific part comes last. Keep the prefix byte-stable: a stray timestamp or a reordered tool list silently invalidates the cache, and you pay full price without knowing why.

Caching is per provider. When a fallback reroutes to a second endpoint, expect a cold prefix there — fold that into the latency budget so a reroute degrades predictably instead of surprising you.

Degrade on purpose.

When every provider for a tier is exhausted, you have a choice to make before the request starts, not after it fails. Sometimes the right move is to drop to a lower tier and answer; sometimes it's to return a clear, honest partial result.

Make the degradation explicit and visible. A frontier reasoning task that falls back to a balanced model should say so in the trace, because the answer quality changed even though the request succeeded. Silent downgrades are how you ship worse output and never find out.

The router's job is to keep the product responsive under a bad minute upstream. Pick the task's tier deliberately, give it real fallbacks and real budgets, and the outage becomes a slower answer instead of an error page.

A refusal, a 429, or a 529 should reroute inside the request — not surface as a failure the user has to absorb.