What I Learned Monitoring LLMs in Production for a Year

I spent 2024 as an LLM Observability Engineer at DigitizedLLC, deploying Claude 3.5 Sonnet and Llama 3.2 via AWS Bedrock into production systems. My job was to make sure we could answer one question at any given moment: "Are our AI systems healthy?"

That sounds straightforward. It isn't. LLMs break every assumption traditional monitoring is built on. Here's what I learned.

Traditional APM Doesn't Work for LLMs

Application Performance Monitoring tools like Datadog, New Relic, and Grafana are built around a simple model: requests come in, responses go out, measure the latency and error rate. If latency spikes or errors increase, alert.

LLMs violate this model in several ways:

No binary success/failure. A traditional API returns a 200 or a 500. An LLM returns a 200 with text that might be completely wrong. The HTTP status code tells you the request completed, not that the output was correct. You need quality metrics, not just availability metrics.

Latency is not constant. A 10-token response takes 200ms. A 4,000-token response takes 8 seconds. Both are "normal." Setting a static latency threshold doesn't work. You need to normalize latency by token count and track time-to-first-token separately from total generation time.

Cost scales with input, not with requests. A single request with a 100K token context window costs more than 1,000 requests with 100-token prompts. Request count is a meaningless metric for cost monitoring. You need token-level tracking with cost attribution per model and per feature.

Behavior drifts over time. Model providers update their models. Prompt templates get modified. Context documents change. The same system can produce measurably different outputs this month versus last month with zero code changes. You need longitudinal quality tracking.

The Metrics That Actually Matter

After a year of iteration, here's what our production LLM monitoring stack converged on:

1. Token Economics

Track these per request, per model, per feature:

Input tokens — how much context are you sending?
Output tokens — how verbose are the responses?
Cost per request — calculated from token counts and model pricing
Cost per user action — the business-relevant number

We found that 15% of our prompt templates were responsible for 60% of our token spend. Optimizing those templates (better system prompts, more focused context retrieval) cut costs significantly without affecting output quality.

2. Latency Decomposition

Don't just track total latency. Break it down:

Time to First Token (TTFT) — how long until the user sees something
Tokens per second — generation speed after first token
Total generation time — wall clock duration
Queue time — time waiting for model capacity (especially relevant with Bedrock)

TTFT is the metric users feel. If TTFT is under 500ms, the response feels instant even if total generation takes 5 seconds. If TTFT is 3 seconds, users think the system is broken. We set our alerts on TTFT, not total latency.

3. Quality Signals

This is the hard part. How do you measure if an LLM output is "good"?

Output length distribution. Sudden changes in average output length often indicate a problem — the model is being too terse (context issue) or too verbose (prompt regression).

Refusal rate. Track how often the model refuses to answer or returns safety-filtered responses. A spike in refusals usually means your prompt template or context retrieval is feeding the model something triggering.

User feedback signals. Thumbs up/down, regeneration rate, follow-up question rate. These are imperfect but they're the closest thing to ground truth in production.

Structured output compliance. If you're expecting JSON, track the parse success rate. If you're expecting a specific format, validate it. Format compliance failures are the canary in the coal mine for broader quality issues.

4. Provider Health

If you're using a managed service like AWS Bedrock:

Throttle rate — how often are you hitting rate limits?
Invocation errors — model unavailable, timeout, capacity issues
Cross-model comparison — if you run the same prompt through multiple models, track quality and cost differences

We built dashboards that compared Claude 3.5 Sonnet and Llama 3.2 side-by-side on the same workloads. This data informed our routing decisions — when to use which model for which task.

Building the Alerting Stack

Traditional alerting (static thresholds, anomaly detection) partially works, but you need to layer on LLM-specific alerts:

Cost anomaly detection. We used Datadog's anomaly detection on hourly token spend. A runaway prompt loop can burn through your budget in minutes if you're using a large model with big context windows.

TTFT degradation. Alert when P95 TTFT exceeds 2 seconds for more than 5 minutes. This catches provider-side issues and capacity problems before they affect user experience.

Quality regression. Alert when structured output parse failure rate exceeds 5% over a 1-hour window. This catches prompt regressions and model behavior changes.

The PagerDuty integration mattered. We integrated Datadog alerts with PagerDuty for on-call routing. LLM issues need the same incident response rigor as traditional infrastructure issues. The difference is that your runbook includes "check the prompt template" and "compare outputs to last week's baseline" in addition to the usual infrastructure checks.

The 63% MTTR Reduction

When I started, the team's mean time to resolution for LLM-related incidents was measured in hours. By the end of 2024, we'd reduced it by 63%. Here's what drove that improvement:

Structured dashboards. Instead of digging through logs, engineers could look at a single dashboard and see: token spend (normal/abnormal), latency decomposition (where is time being spent), quality signals (are outputs degrading), and provider health (is Bedrock throttling us). Having everything in one place eliminated the investigation phase for most incidents.

Runbooks for LLM-specific failure modes. We documented the top 10 LLM failure patterns with clear decision trees. "Output quality degraded" → check prompt template version → check context retrieval results → check model provider status → check recent deployment changes. Runbooks turn a 2-hour investigation into a 15-minute checklist.

Automated context collection. When an alert fires, our system automatically collects: the last 10 requests that triggered the alert, the prompt template version, the model being used, and the context window contents. The on-call engineer gets all of this in the PagerDuty alert, so they can start diagnosing immediately.

Takeaways

If you're deploying LLMs in production and don't have observability yet, start here:

Track tokens, not just requests. Token count is the fundamental unit of LLM cost and performance.
Measure TTFT separately. It's the metric users actually feel.
Build quality signals early. Even imperfect ones (output length, format compliance) are better than nothing.
Alert on cost anomalies. A single bug can cost you thousands with LLMs. Set up cost alerts before anything else.
Treat LLM incidents like infrastructure incidents. On-call, runbooks, post-mortems. The same SRE rigor applies.

LLM observability is still an emerging field. Most companies deploying AI don't have adequate monitoring. If you're an SRE or platform engineer, this is one of the highest-leverage areas to build expertise in right now.

For more on this topic, check out my post on OpenRouter + Datadog Observability for a specific integration guide.

Next Reads

How PeakofEloquence.org Scaled to 490K Monthly Users

The technical story behind scaling an open-source education platform to 490K+ monthly active users across 15+ countries — edge computing, Kubernetes, and lessons from unexpected viral growth.

Building a Shopify MCP Server: AI Agents Meet E-Commerce

How I built a Model Context Protocol server that lets AI agents manage Shopify stores — GraphQL, tool design, and getting listed on 5+ MCP registries.