Observability for LLM Systems in Production

What you'll learn: By the end of this guide you will be able to instrument every LLM call with OpenTelemetry, build a cost and latency dashboard in Grafana, define and alert on the four golden signals for AI systems, and run a structured incident response process when your LLM feature degrades in production.

You cannot improve what you cannot measure. LLM systems are notoriously opaque — a response that looks correct might be hallucinated; a latency spike might be model provider instability or a prompt injection attack. Observability is what turns "something feels wrong" into "the p95 latency on the /support endpoint increased 340% at 14:23 UTC and correlates with a 12% drop in the faithfulness score."

The Three Pillars for LLM Systems

Standard observability (metrics, logs, traces) applies to LLM systems with AI-specific additions.

graph TD subgraph Instrumentation["Instrumentation Layer"] APP[Your LLM Application] OT[OpenTelemetry SDK] APP --> OT end subgraph Collection["Collection"] OTELC[OTel Collector] OT --> OTELC end subgraph Backends["Observability Backends"] PROM[Prometheus\nMetrics] JAEGER[Jaeger / Tempo\nTraces] LOKI[Loki\nLogs] OTELC --> PROM OTELC --> JAEGER OTELC --> LOKI end subgraph Visualisation["Visualisation + Alerting"] GRAFANA[Grafana Dashboard] ALERT[Alert Manager] PROM --> GRAFANA JAEGER --> GRAFANA LOKI --> GRAFANA GRAFANA --> ALERT end style Instrumentation fill:#0d2d3a,stroke:#06b6d4,color:#fff style Backends fill:#1a0d2d,stroke:#9333ea,color:#fff

What to Measure: The Four Golden Signals for AI

The classic four golden signals (latency, traffic, errors, saturation) apply — with AI-specific extensions:

1. Latency

Standard latency plus LLM-specific breakdown:

TTFT (Time to First Token) — user experience signal
Total generation time — full response latency
Retrieval latency (for RAG) — separately from LLM latency

@tracer.start_as_current_span("llm_call")
def call_llm(prompt: str, context: str) -> str:
    span = trace.get_current_span()
    
    start = time.time()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    
    latency_ms = (time.time() - start) * 1000
    
    # Record as span attributes
    span.set_attribute("llm.latency_ms", latency_ms)
    span.set_attribute("llm.input_tokens", response.usage.prompt_tokens)
    span.set_attribute("llm.output_tokens", response.usage.completion_tokens)
    span.set_attribute("llm.model", "gpt-4o")
    
    return response.choices[0].message.content

2. Traffic + Cost

LLM-specific traffic metric: tokens/minute and cost/request.

Track these as counters so you can alert on cost spikes before they become billing surprises:

# Prometheus counters
input_tokens_total = Counter('llm_input_tokens_total', 'Total input tokens', ['model', 'endpoint'])
output_tokens_total = Counter('llm_output_tokens_total', 'Total output tokens', ['model', 'endpoint'])
cost_dollars_total = Counter('llm_cost_dollars_total', 'Total cost in USD', ['model', 'endpoint'])

GPT4O_INPUT_COST  = 5.0   / 1_000_000   # $5 per 1M input tokens
GPT4O_OUTPUT_COST = 15.0  / 1_000_000   # $15 per 1M output tokens

def record_usage(model: str, endpoint: str, input_tokens: int, output_tokens: int):
    input_tokens_total.labels(model=model, endpoint=endpoint).inc(input_tokens)
    output_tokens_total.labels(model=model, endpoint=endpoint).inc(output_tokens)
    cost = (input_tokens * GPT4O_INPUT_COST) + (output_tokens * GPT4O_OUTPUT_COST)
    cost_dollars_total.labels(model=model, endpoint=endpoint).inc(cost)

3. Errors

Standard HTTP errors plus LLM-specific failure modes:

Provider errors (429 rate limit, 500 server error)
Quality errors (hallucination detected, output validation failed)
Safety violations (content filtered, injection detected)

Log every error with full context: prompt hash, user ID, endpoint, model version.

4. Saturation

For LLM systems: context window utilisation and queue depth.

Alert when average context utilisation exceeds 80% — you're approaching the limit where truncation silently degrades quality.

Structured Logging for LLM Calls

Every LLM call should produce a JSON log entry:

{
  "timestamp": "2026-06-01T14:23:01.456Z",
  "trace_id": "abc123",
  "span_id": "def456",
  "service": "support-bot",
  "endpoint": "/api/support/answer",
  "user_id": "u-789",
  "tenant_id": "acme-corp",
  "model": "gpt-4o",
  "input_tokens": 1847,
  "output_tokens": 312,
  "latency_ms": 1240,
  "cost_usd": 0.0124,
  "cache_hit": false,
  "retrieval_chunks": 5,
  "faithfulness_score": 0.94,
  "answer_relevance_score": 0.88
}

The last two fields (faithfulness, answer relevance) require an automated eval step — expensive but worth it for high-stakes endpoints.

Grafana Dashboard Layout

Build your LLM observability dashboard with these panels:

Row 1 — Real-time health:

Requests/min by endpoint
p50/p95/p99 latency
Error rate %
Active users

Row 2 — Cost:

Total cost today vs. yesterday
Cost per request trend (last 7 days)
Token consumption breakdown (input vs. output)
Projected monthly cost

Row 3 — Quality (if eval pipeline exists):

Faithfulness score trend
Answer relevance trend
Cache hit rate
Context window utilisation

Row 4 — Infrastructure:

Provider error rate by model
Retry rate
Circuit breaker state

Alerting Rules

# Alert: Cost spike
- alert: LLMCostSpike
  expr: rate(llm_cost_dollars_total[5m]) * 3600 > 50
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "LLM hourly cost exceeds $50 (current: {{ $value | humanize }}/hr)"

# Alert: Latency regression
- alert: LLMLatencyHigh
  expr: histogram_quantile(0.95, rate(llm_latency_ms_bucket[5m])) > 3000
  for: 3m
  labels:
    severity: critical
  annotations:
    summary: "p95 LLM latency above 3s"

# Alert: Error rate
- alert: LLMHighErrorRate
  expr: rate(llm_errors_total[5m]) / rate(llm_requests_total[5m]) > 0.05
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "LLM error rate above 5%"

# Alert: Quality degradation (requires eval scores)
- alert: LLMQualityDegradation
  expr: avg_over_time(llm_faithfulness_score[1h]) < 0.75
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Average faithfulness score below 0.75 for 30 minutes"

Incident Response for LLM Degradation

When an alert fires, follow this runbook:

Identify scope — which endpoints, which models, which tenants?
Check provider status — status.openai.com, status.anthropic.com
Review recent deployments — did a prompt change ship in the last 30 minutes?
Check cost anomalies — a spike in tokens often precedes a quality issue
Activate fallback — switch to secondary model if primary provider is degraded
Capture a trace — sample 10 recent requests from the affected endpoint
Post-mortem — document root cause, timeline, and prevention within 24 hours

Key Takeaways

You cannot improve what you cannot measure — every LLM endpoint needs latency, token, cost, and error metrics from day one
The four golden signals for AI: Latency + TTFT, Traffic + Cost, Errors + Quality, Saturation + Context utilisation
Structured JSON logs with trace_id + user_id + tokens + cost + quality scores enable every investigation
Alert on cost trends, not just absolute values — a growing cost/request trend at 3am is more dangerous than a known peak
Quality metrics (faithfulness, relevance) require an eval pipeline but are the only true measure of LLM health
Separate retrieval latency from LLM latency in traces — they degrade for completely different reasons and require different fixes
Keep a fallback model ready — circuit breakers + model switching should be tested before you need them in a real incident

Practice Exercises

Exercise 1 — Starter (2 hours): Instrument an existing LLM application with OpenTelemetry. Add span attributes for input_tokens, output_tokens, latency_ms, and model. Export to a local Jaeger instance and verify you can see the trace. Calculate the cost of each request from the span attributes.

Exercise 2 — Intermediate (half day): Build a Grafana dashboard from scratch with 8 panels covering the four golden signals. Add a Prometheus alert rule for cost spikes (>$50/hour) and p95 latency (>3s). Simulate the alert conditions locally and verify the alert fires within 2 minutes.

Exercise 3 — Advanced (full day): Implement a weekly quality report. Set up RAGAS to score faithfulness and answer relevance on a sample of 50 production queries per week (using stored logs). Build a Grafana panel that shows the quality trend over the last 8 weeks. Write an automated alert that sends a Slack notification if the weekly average drops more than 5% from the prior week.

Observability for LLM Systems in Production

Observability for LLM Systems in Production

The Three Pillars for LLM Systems

What to Measure: The Four Golden Signals for AI

1. Latency

2. Traffic + Cost

3. Errors

4. Saturation

Structured Logging for LLM Calls

Grafana Dashboard Layout

Alerting Rules

Incident Response for LLM Degradation

Key Takeaways

Practice Exercises

Ask about this article

Enjoyed this? Get more like it
every Monday.

Observability for LLM Systems in Production

Observability for LLM Systems in Production

The Three Pillars for LLM Systems

What to Measure: The Four Golden Signals for AI

1. Latency

2. Traffic + Cost

3. Errors

4. Saturation

Structured Logging for LLM Calls

Grafana Dashboard Layout

Alerting Rules

Incident Response for LLM Degradation

Key Takeaways

Practice Exercises

Ask about this article

Enjoyed this? Get more like it every Monday.

Enjoyed this? Get more like it
every Monday.