OpenRouter + Datadog Observability

In the world of traditional software engineering, deploying an API without logging is malpractice. You wouldn't ship a database query without knowing how long it takes, or an HTTP endpoint without tracking its status codes.

Yet, we see AI applications shipped to production every day where the core logic—the LLM call—is a black box. We treat it like magic: we send a prompt, we get an answer, and we cross our fingers.

As a Site Reliability Engineer, this terrifies me.

If you are routing requests through OpenRouter, you already have a powerful advantage: a unified interface for 100+ models. But the real game-changer is Observability. By coupling OpenRouter's broadcast capability with Datadog LLM Observability, you can treat your AI features like any other production dependency: measurable, traceable, and debuggable.

Here is how to set it up, and more importantly, why you should.

The Integration

The architecture is simple but effective. OpenRouter acts as a middleware that can asynchronously "broadcast" telemetry data—request traces, token counts, costs, and latencies—directly to your Datadog instance. This happens out-of-band, so it adds zero latency to your user-facing requests.

Setup Guide (5 Minutes)

You don't need to install new SDKs or wrap your code in complex tracing logic.

1. Generate a Datadog API Key

Navigate to your Datadog dashboard: Organization Settings > API Keys. Create a new key specifically for this integration (e.g., openrouter-broadcast-key).

2. Configure OpenRouter

Head over to OpenRouter Settings > Broadcast. Toggle Enable Broadcast to ON.

3. Connect the Pipes

Click the edit icon next to Datadog and fill in the details:

API Key: The key you just created.
ML App: A logical name for your service (e.g., production-chatbot or content-engine).
Site URL: This defaults to https://api.us5.datadoghq.com. Check your Datadog URL bar. If you are on app.datadoghq.com, use https://api.datadoghq.com. If you are on us3.datadoghq.com, adjust accordingly.

4. Verify

Click Test Connection. If it turns green, you are live.

Why This Matters for Production

Once the data starts flowing, you move from "guessing" to "engineering". Here is what you get out of the box:

1. Cost Attribution

Datadog will track the exact cost of every request. You can break this down by model, by user (if you pass user IDs in headers), or by feature.

SRE Take: Set up a monitor to alert you if your hourly spend spikes by 200%. Catch infinite loops or abusive traffic before the monthly bill arrives.

2. Latency Waterfalls

Is gpt-4 feeling slow today? Is claude-3.5-sonnet outperforming on speed? The traces show you the full breakdown: Time to First Token (TTFT) and total generation time.

SRE Take: Use this data to dynamically route traffic. If a model's P99 latency breaches your SLA, failover to a faster, smaller model.

3. Quality & Error Tracking

When a request fails, you need to know if it was a timeout, a rate limit, or a content policy violation. Datadog captures the full error trace.

SRE Take: Don't just retry blindly. Analyze the error rates per provider. If one provider is unstable, you have the data to justify switching routing priorities.

Conclusion

Observability is the art of asking questions about your system from the outside. With OpenRouter and Datadog, you can finally answer: "Is my AI application healthy?"

It turns the magic black box into a reliable, engineered component. And for an SRE, that is the only way to ship.