If you’ve shipped an LLM feature, you’ve probably had this moment:

“User says the model went off the rails 10 minutes ago.
Which prompt? Which context? Which tool call? No idea.”

That gap is the missing telemetry layer.

We’ve spent a decade building observability for microservices. For LLMs, we’re mostly back to print().

What I Mean by “Telemetry” for LLMs

For LLM inference, telemetry isn’t magic. It’s just structured answers to:

  • What did we send?
    Full prompt, system prompt, retrieved context, parameters (temperature, max_tokens, etc.).

  • What did we get back?
    Raw output, finish reason, any tool calls / function calls.

  • How did it behave?
    Latency per step, tokens in/out, retries, error types.

  • What happened around it?
    User, session, upstream/downstream services, model version, prompt template version.

It’s trace_id + prompt + output + timing + metadata for every call.

Why Most LLM Stacks Don’t Have This

A few reasons:

  1. The hype outran the plumbing.
    Everyone wired up /v1/chat/completions and shipped something before thinking about ops.

  2. Old tools don’t fit quite right.
    Traditional APMs know about HTTP handlers and SQL queries, not “tool-using agents” or “RAG chains.”

  3. LLMs are non-deterministic.
    With classic code, you can often reproduce a bug locally. With LLMs, the same prompt might not misbehave twice. If you didn’t log it the first time, it’s gone.

Meanwhile we keep citing “Attention Is All You Need” (Vaswani et al., 2017) and “Language Models are Few-Shot Learners” (Brown et al., 2020), but in production the story is closer to “Logs Are Missing So Good Luck.”

The Pain You Feel Without Telemetry

Here’s what “no telemetry” actually looks like:

  • Debugging = vibes.
    Model output is wrong → you guess: bad prompt? bad retrieval? wrong model version? You can’t see the chain.

  • Latency + cost are black boxes.
    You know your OpenAI/compute bill is high, but not which flows or prompts are burning tokens.

  • Quality silently drifts.
    Provider ships a new model version, or your data distribution changes. Outputs degrade, but you only notice when users complain.

  • No audit trail.
    In regulated domains you need: “What did the AI say to whom and why?” If you’re not logging prompts/outputs/metadata, you can’t answer that.

At scale, this stops being an annoyance and becomes: we cannot responsibly run this system.

What’s Emerging as the Telemetry Layer

You’re starting to see a stack form:

  • OpenTelemetry (OTel):
    The boring but important part. Standard spans/metrics/logs for “LLM prompt”, “embedding lookup”, “RAG step”, etc. So your LLM traces can show up next to HTTP traces in Jaeger/Grafana/Datadog instead of in some random JSON file.

  • LLM-native tracing tools:

    • Langfuse – open-source traces + prompt/output/cost logging, nice UI, OTel-friendly.

    • LangSmith – if you’re in LangChain land, it shows full chain/agent runs, lets you replay and evaluate them.

    • Weights & Biases (W&B) – extends from training into inference: log prompts/outputs, attach evals, see regressions.

  • Guardrails / policies:
    Libraries that validate outputs and also emit metrics (pass/fail rates, guard latency). These become just more spans in your trace.

The pattern is clear: treat LLM calls like any other critical RPC, but with extra semantic metadata (prompt, context, model, eval scores).

If I Were Building a New LLM Product Today

I’d keep it stupidly simple:

  1. Every LLM call is a span.
    trace_id, user_id, model, prompt_template_id, latency_ms, tokens_in, tokens_out, cost_estimate.

  2. Store the raw prompt + output (with redaction where needed).
    That’s your goldmine for debugging, eval, and fine-tuning later.

  3. Adopt OTel early.
    So you can swap backends (Langfuse, LangSmith, W&B, Datadog, whatever) without rewriting instrumentation.

  4. Tie in evaluation.
    Periodically run automatic checks (accuracy on a labeled set, toxicity, policy violations) over logged traces. Alert if metrics move.

  5. Make traces a first-class artifact.
    PR reviews don’t just ask “did you add a new prompt?” but “how does this show up in traces, and how will we debug it?”

The models are powerful enough. The missing piece in most stacks isn’t a bigger transformer; it’s simply being able to answer “what happened?” with data instead of guesses.

That’s the telemetry layer.

Get the inside scope.

Sign up for our newsletter.