Why trace the agent loop instead of individual model calls?

A single user turn fans out into many model and tool calls inside one loop. Tracing isolated calls can't tell you whether you're seeing one agent looping forty times or forty healthy agents — and the looping case is the one that pages you. A loop-shaped trace, with one root span per run and child spans per iteration, makes iteration count and termination visible, which is the first thing you need in an incident.

How should I attribute cost per tenant?

Record input and output tokens on the span where each model response happened, along with the model id and the tenant id. Convert to cost at query time using a rate table keyed by model, since different models in the same agent have very different rates. Then roll up per run and per tenant. Per-tenant cost is the number that shows up in margin conversations, so it has to be a first-class dimension, not something you reconstruct later.

What's the right way to handle PII in agent traces?

Redact at the boundary, inside the trace wrapper, before anything is written — not in a later cleanup pass. Run a detector over prompt and tool-result text and swap matches for typed placeholders like [EMAIL] or [ACCOUNT_ID], preserving structure and length so the trace stays useful for debugging. The bar is whether you'd hand the log to an on-call engineer with no reason to see customer data.

Engineering June 09, 2026

Observability for agents in production.

An agent is a loop, not a request. If your observability still thinks in single calls, you're blind to the exact thing that breaks in production — the loop that won't stop, the cost that creeps, the tool that quietly fails.

Trace the loop, not the call.

A request/response trace assumes one model call answers one user. Agents don't work that way. One user turn fans out into a loop: a prompt, a model response, a tool call, a result fed back, another model response, and so on until the agent decides it's done.

Structure your trace to mirror that. One root span per agent run. Child spans for every iteration. Inside each iteration, a span for the prompt sent, the model response, and each tool invocation with its arguments and result. The shape should let you count iterations at a glance, because iteration count is the single most diagnostic number you have.

If your trace is a flat list of model calls with no parent, you can't answer the first question that matters in an incident: was this one agent looping forty times, or forty agents each behaving normally?

Attribute tokens and cost at every step.

Every model response carries input and output token counts. Record them on the span where they happened, not aggregated at the end of the run. You want to see which iteration burned the tokens, and within an iteration, how much of the input was system prompt versus tool results versus accumulated history.

Convert tokens to cost at write time using a rate table keyed by model id. Don't store cost as a fixed number you'll have to recompute when prices change; store tokens and the model, and resolve cost in the query layer or with a versioned rate. Different models in the same agent — say Opus 4.8 for planning and Haiku 4.5 for a cheap classification step — have very different rates, so cost only means something when it's per-step and per-model.

Then roll it up two ways: per run, and per tenant. Per run tells you which agent invocations are expensive. Per tenant tells you who is expensive, which is the number that shows up in margin conversations.

Wrap the loop once.

You don't need a tracing call at every line. You need a wrapper around the three primitives an agent repeats — the model call, the tool call, and the iteration boundary — and everything else falls out of that.

The snippet below wraps a single model call: it opens a span, records the model and prompt, captures token usage from the response, and attaches the tenant so cost rolls up correctly. The same pattern wraps tool calls, recording arguments, result size, and success or failure.

async function tracedModelCall(span, ctx, req) {
  const child = span.start("model", {
    model: req.model,
    tenant: ctx.tenantId,
    iteration: ctx.iteration,
  });
  try {
    const res = await client.messages.create(req);
    child.set({
      input_tokens: res.usage.input_tokens,
      output_tokens: res.usage.output_tokens,
      stop_reason: res.stop_reason,
    });
    return res;
  } catch (err) {
    child.set({ error: err.name, status: err.status });
    throw err;
  } finally {
    child.end();
  }
}

Alert on what actually hurts.

Most agent incidents are one of four shapes, and each has a signal you can watch directly. Runaway loops: iterations per run crossing a hard ceiling, or a single run that hasn't terminated. Climbing latency: p95 time-per-run trending up, usually because tool calls or context length are growing. Failure spikes: tool error rate or model error rate jumping over baseline. Cost drift: tokens-per-run or cost-per-tenant rising without a matching rise in traffic.

Alert on those four, not on raw call volume. A spike in model calls is meaningless on its own — it could be healthy growth or a single agent stuck in a loop. The loop-shaped metrics tell you which.

Put a hard cap on iterations in code as a backstop, and alert before you hit it. An agent that reaches its ceiling and gets killed is a contained incident. One with no ceiling is an unbounded bill.

Watch provider fallbacks as a signal.

If you route across providers, a fallback fires on a refusal, a 429 rate limit, or a 529 overloaded. Each of those reasons means something different, so record the reason on the span, not just the fact that a fallback happened.

A climbing 429 rate means you're brushing your quota and latency will follow. A run of 529s is the provider, not you, and the right response is patience plus capacity planning, not a code change. Refusals clustering on one tenant or one prompt template is a content or prompt problem worth a look. Collapse all three into one 'fallback' counter and you've thrown away the reason you'd act on.

Redact PII at the boundary.

Prompts and tool results are the richest debugging data you have and the most dangerous to keep. They contain whatever the user typed and whatever your tools returned — names, emails, account numbers, free text.

Redact at the boundary, before the trace is written, not in a nightly cleanup job. Run a detector over prompt and result text in the trace wrapper and replace matches with typed placeholders like [EMAIL] or [ACCOUNT_ID]. Keep the structure and lengths so the trace is still useful for debugging; drop the values. The test is simple: a log you'd hand to an on-call engineer who has no reason to see customer data.

Redact in the wrapper, before the trace is written — not in a nightly cleanup job. By the time it's written, it's already leaked.

A dashboard an operator can read.

The point of all this is a screen someone can read at 2am without a data team. Lead with the four numbers that map to the four failure modes: iterations per run, p95 time per run, failure rate, and cost per run — each with its trend against yesterday.

Below that, two breakdowns: cost by tenant and cost by model, sorted descending. That's the whole margin story on one screen — who costs you money and which model they're spending it on. A table of the longest and most expensive recent runs, each linking to its full trace, lets an operator go from 'something's off' to the exact loop in two clicks.

No SQL, no notebook, no query language. If reading your agent's health requires writing a query, it isn't observability — it's a database with extra steps.