Observability for LLM Systems in Production
What you'll learn: By the end of this guide you will be able to instrument every LLM call with OpenTelemetry, build a cost and latency dashboard in Grafana, define and alert on the four golden signals for AI systems, and run a structured incident response process when your LLM feature degrades in production.
You cannot improve what you cannot measure. LLM systems are notoriously opaque — a response that looks correct might be hallucinated; a latency spike might be model provider instability or a prompt injection attack. Observability is what turns "something feels wrong" into "the p95 latency on the /support endpoint increased 340% at 14:23 UTC and correlates with a 12% drop in the faithfulness score."
The Three Pillars for LLM Systems
Standard observability (metrics, logs, traces) applies to LLM systems with AI-specific additions.
What to Measure: The Four Golden Signals for AI
The classic four golden signals (latency, traffic, errors, saturation) apply — with AI-specific extensions:
1. Latency
Standard latency plus LLM-specific breakdown:
- TTFT (Time to First Token) — user experience signal
- Total generation time — full response latency
- Retrieval latency (for RAG) — separately from LLM latency
@tracer.start_as_current_span("llm_call")
def call_llm(prompt: str, context: str) -> str:
span = trace.get_current_span()
start = time.time()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
latency_ms = (time.time() - start) * 1000
# Record as span attributes
span.set_attribute("llm.latency_ms", latency_ms)
span.set_attribute("llm.input_tokens", response.usage.prompt_tokens)
span.set_attribute("llm.output_tokens", response.usage.completion_tokens)
span.set_attribute("llm.model", "gpt-4o")
return response.choices[0].message.content
2. Traffic + Cost
LLM-specific traffic metric: tokens/minute and cost/request.
Track these as counters so you can alert on cost spikes before they become billing surprises:
# Prometheus counters
input_tokens_total = Counter('llm_input_tokens_total', 'Total input tokens', ['model', 'endpoint'])
output_tokens_total = Counter('llm_output_tokens_total', 'Total output tokens', ['model', 'endpoint'])
cost_dollars_total = Counter('llm_cost_dollars_total', 'Total cost in USD', ['model', 'endpoint'])
GPT4O_INPUT_COST = 5.0 / 1_000_000 # $5 per 1M input tokens
GPT4O_OUTPUT_COST = 15.0 / 1_000_000 # $15 per 1M output tokens
def record_usage(model: str, endpoint: str, input_tokens: int, output_tokens: int):
input_tokens_total.labels(model=model, endpoint=endpoint).inc(input_tokens)
output_tokens_total.labels(model=model, endpoint=endpoint).inc(output_tokens)
cost = (input_tokens * GPT4O_INPUT_COST) + (output_tokens * GPT4O_OUTPUT_COST)
cost_dollars_total.labels(model=model, endpoint=endpoint).inc(cost)
3. Errors
Standard HTTP errors plus LLM-specific failure modes:
- Provider errors (429 rate limit, 500 server error)
- Quality errors (hallucination detected, output validation failed)
- Safety violations (content filtered, injection detected)
Log every error with full context: prompt hash, user ID, endpoint, model version.
4. Saturation
For LLM systems: context window utilisation and queue depth.
Alert when average context utilisation exceeds 80% — you're approaching the limit where truncation silently degrades quality.
Structured Logging for LLM Calls
Every LLM call should produce a JSON log entry:
{
"timestamp": "2026-06-01T14:23:01.456Z",
"trace_id": "abc123",
"span_id": "def456",
"service": "support-bot",
"endpoint": "/api/support/answer",
"user_id": "u-789",
"tenant_id": "acme-corp",
"model": "gpt-4o",
"input_tokens": 1847,
"output_tokens": 312,
"latency_ms": 1240,
"cost_usd": 0.0124,
"cache_hit": false,
"retrieval_chunks": 5,
"faithfulness_score": 0.94,
"answer_relevance_score": 0.88
}
The last two fields (faithfulness, answer relevance) require an automated eval step — expensive but worth it for high-stakes endpoints.
Grafana Dashboard Layout
Build your LLM observability dashboard with these panels:
Row 1 — Real-time health:
- Requests/min by endpoint
- p50/p95/p99 latency
- Error rate %
- Active users
Row 2 — Cost:
- Total cost today vs. yesterday
- Cost per request trend (last 7 days)
- Token consumption breakdown (input vs. output)
- Projected monthly cost
Row 3 — Quality (if eval pipeline exists):
- Faithfulness score trend
- Answer relevance trend
- Cache hit rate
- Context window utilisation
Row 4 — Infrastructure:
- Provider error rate by model
- Retry rate
- Circuit breaker state
Alerting Rules
# Alert: Cost spike
- alert: LLMCostSpike
expr: rate(llm_cost_dollars_total[5m]) * 3600 > 50
for: 5m
labels:
severity: warning
annotations:
summary: "LLM hourly cost exceeds $50 (current: {{ $value | humanize }}/hr)"
# Alert: Latency regression
- alert: LLMLatencyHigh
expr: histogram_quantile(0.95, rate(llm_latency_ms_bucket[5m])) > 3000
for: 3m
labels:
severity: critical
annotations:
summary: "p95 LLM latency above 3s"
# Alert: Error rate
- alert: LLMHighErrorRate
expr: rate(llm_errors_total[5m]) / rate(llm_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "LLM error rate above 5%"
# Alert: Quality degradation (requires eval scores)
- alert: LLMQualityDegradation
expr: avg_over_time(llm_faithfulness_score[1h]) < 0.75
for: 30m
labels:
severity: warning
annotations:
summary: "Average faithfulness score below 0.75 for 30 minutes"
Incident Response for LLM Degradation
When an alert fires, follow this runbook:
- Identify scope — which endpoints, which models, which tenants?
- Check provider status — status.openai.com, status.anthropic.com
- Review recent deployments — did a prompt change ship in the last 30 minutes?
- Check cost anomalies — a spike in tokens often precedes a quality issue
- Activate fallback — switch to secondary model if primary provider is degraded
- Capture a trace — sample 10 recent requests from the affected endpoint
- Post-mortem — document root cause, timeline, and prevention within 24 hours
Key Takeaways
- You cannot improve what you cannot measure — every LLM endpoint needs latency, token, cost, and error metrics from day one
- The four golden signals for AI: Latency + TTFT, Traffic + Cost, Errors + Quality, Saturation + Context utilisation
- Structured JSON logs with trace_id + user_id + tokens + cost + quality scores enable every investigation
- Alert on cost trends, not just absolute values — a growing cost/request trend at 3am is more dangerous than a known peak
- Quality metrics (faithfulness, relevance) require an eval pipeline but are the only true measure of LLM health
- Separate retrieval latency from LLM latency in traces — they degrade for completely different reasons and require different fixes
- Keep a fallback model ready — circuit breakers + model switching should be tested before you need them in a real incident
Practice Exercises
Exercise 1 — Starter (2 hours): Instrument an existing LLM application with OpenTelemetry. Add span attributes for input_tokens, output_tokens, latency_ms, and model. Export to a local Jaeger instance and verify you can see the trace. Calculate the cost of each request from the span attributes.
Exercise 2 — Intermediate (half day): Build a Grafana dashboard from scratch with 8 panels covering the four golden signals. Add a Prometheus alert rule for cost spikes (>$50/hour) and p95 latency (>3s). Simulate the alert conditions locally and verify the alert fires within 2 minutes.
Exercise 3 — Advanced (full day): Implement a weekly quality report. Set up RAGAS to score faithfulness and answer relevance on a sample of 50 production queries per week (using stored logs). Build a Grafana panel that shows the quality trend over the last 8 weeks. Write an automated alert that sends a Slack notification if the weekly average drops more than 5% from the prior week.