LLMOps
    June 2, 2026

    Observability for LLM Systems in Production

    How to instrument, monitor, and alert on LLM apps — distributed tracing, cost dashboards, quality metrics, and incident response for AI systems.

    Share

    Observability for LLM Systems in Production

    What you'll learn: By the end of this guide you will be able to instrument every LLM call with OpenTelemetry, build a cost and latency dashboard in Grafana, define and alert on the four golden signals for AI systems, and run a structured incident response process when your LLM feature degrades in production.

    You cannot improve what you cannot measure. LLM systems are notoriously opaque — a response that looks correct might be hallucinated; a latency spike might be model provider instability or a prompt injection attack. Observability is what turns "something feels wrong" into "the p95 latency on the /support endpoint increased 340% at 14:23 UTC and correlates with a 12% drop in the faithfulness score."

    The Three Pillars for LLM Systems

    Standard observability (metrics, logs, traces) applies to LLM systems with AI-specific additions.

    graph TD subgraph Instrumentation["Instrumentation Layer"] APP[Your LLM Application] OT[OpenTelemetry SDK] APP --> OT end subgraph Collection["Collection"] OTELC[OTel Collector] OT --> OTELC end subgraph Backends["Observability Backends"] PROM[Prometheus\nMetrics] JAEGER[Jaeger / Tempo\nTraces] LOKI[Loki\nLogs] OTELC --> PROM OTELC --> JAEGER OTELC --> LOKI end subgraph Visualisation["Visualisation + Alerting"] GRAFANA[Grafana Dashboard] ALERT[Alert Manager] PROM --> GRAFANA JAEGER --> GRAFANA LOKI --> GRAFANA GRAFANA --> ALERT end style Instrumentation fill:#0d2d3a,stroke:#06b6d4,color:#fff style Backends fill:#1a0d2d,stroke:#9333ea,color:#fff

    What to Measure: The Four Golden Signals for AI

    The classic four golden signals (latency, traffic, errors, saturation) apply — with AI-specific extensions:

    1. Latency

    Standard latency plus LLM-specific breakdown:

    • TTFT (Time to First Token) — user experience signal
    • Total generation time — full response latency
    • Retrieval latency (for RAG) — separately from LLM latency
    @tracer.start_as_current_span("llm_call")
    def call_llm(prompt: str, context: str) -> str:
        span = trace.get_current_span()
        
        start = time.time()
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )
        
        latency_ms = (time.time() - start) * 1000
        
        # Record as span attributes
        span.set_attribute("llm.latency_ms", latency_ms)
        span.set_attribute("llm.input_tokens", response.usage.prompt_tokens)
        span.set_attribute("llm.output_tokens", response.usage.completion_tokens)
        span.set_attribute("llm.model", "gpt-4o")
        
        return response.choices[0].message.content

    2. Traffic + Cost

    LLM-specific traffic metric: tokens/minute and cost/request.

    Track these as counters so you can alert on cost spikes before they become billing surprises:

    # Prometheus counters
    input_tokens_total = Counter('llm_input_tokens_total', 'Total input tokens', ['model', 'endpoint'])
    output_tokens_total = Counter('llm_output_tokens_total', 'Total output tokens', ['model', 'endpoint'])
    cost_dollars_total = Counter('llm_cost_dollars_total', 'Total cost in USD', ['model', 'endpoint'])
    
    GPT4O_INPUT_COST  = 5.0   / 1_000_000   # $5 per 1M input tokens
    GPT4O_OUTPUT_COST = 15.0  / 1_000_000   # $15 per 1M output tokens
    
    def record_usage(model: str, endpoint: str, input_tokens: int, output_tokens: int):
        input_tokens_total.labels(model=model, endpoint=endpoint).inc(input_tokens)
        output_tokens_total.labels(model=model, endpoint=endpoint).inc(output_tokens)
        cost = (input_tokens * GPT4O_INPUT_COST) + (output_tokens * GPT4O_OUTPUT_COST)
        cost_dollars_total.labels(model=model, endpoint=endpoint).inc(cost)

    3. Errors

    Standard HTTP errors plus LLM-specific failure modes:

    • Provider errors (429 rate limit, 500 server error)
    • Quality errors (hallucination detected, output validation failed)
    • Safety violations (content filtered, injection detected)

    Log every error with full context: prompt hash, user ID, endpoint, model version.

    4. Saturation

    For LLM systems: context window utilisation and queue depth.

    Alert when average context utilisation exceeds 80% — you're approaching the limit where truncation silently degrades quality.

    Structured Logging for LLM Calls

    Every LLM call should produce a JSON log entry:

    {
      "timestamp": "2026-06-01T14:23:01.456Z",
      "trace_id": "abc123",
      "span_id": "def456",
      "service": "support-bot",
      "endpoint": "/api/support/answer",
      "user_id": "u-789",
      "tenant_id": "acme-corp",
      "model": "gpt-4o",
      "input_tokens": 1847,
      "output_tokens": 312,
      "latency_ms": 1240,
      "cost_usd": 0.0124,
      "cache_hit": false,
      "retrieval_chunks": 5,
      "faithfulness_score": 0.94,
      "answer_relevance_score": 0.88
    }

    The last two fields (faithfulness, answer relevance) require an automated eval step — expensive but worth it for high-stakes endpoints.

    Grafana Dashboard Layout

    Build your LLM observability dashboard with these panels:

    Row 1 — Real-time health:

    • Requests/min by endpoint
    • p50/p95/p99 latency
    • Error rate %
    • Active users

    Row 2 — Cost:

    • Total cost today vs. yesterday
    • Cost per request trend (last 7 days)
    • Token consumption breakdown (input vs. output)
    • Projected monthly cost

    Row 3 — Quality (if eval pipeline exists):

    • Faithfulness score trend
    • Answer relevance trend
    • Cache hit rate
    • Context window utilisation

    Row 4 — Infrastructure:

    • Provider error rate by model
    • Retry rate
    • Circuit breaker state

    Alerting Rules

    # Alert: Cost spike
    - alert: LLMCostSpike
      expr: rate(llm_cost_dollars_total[5m]) * 3600 > 50
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "LLM hourly cost exceeds $50 (current: {{ $value | humanize }}/hr)"
    
    # Alert: Latency regression
    - alert: LLMLatencyHigh
      expr: histogram_quantile(0.95, rate(llm_latency_ms_bucket[5m])) > 3000
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "p95 LLM latency above 3s"
    
    # Alert: Error rate
    - alert: LLMHighErrorRate
      expr: rate(llm_errors_total[5m]) / rate(llm_requests_total[5m]) > 0.05
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "LLM error rate above 5%"
    
    # Alert: Quality degradation (requires eval scores)
    - alert: LLMQualityDegradation
      expr: avg_over_time(llm_faithfulness_score[1h]) < 0.75
      for: 30m
      labels:
        severity: warning
      annotations:
        summary: "Average faithfulness score below 0.75 for 30 minutes"

    Incident Response for LLM Degradation

    When an alert fires, follow this runbook:

    1. Identify scope — which endpoints, which models, which tenants?
    2. Check provider status — status.openai.com, status.anthropic.com
    3. Review recent deployments — did a prompt change ship in the last 30 minutes?
    4. Check cost anomalies — a spike in tokens often precedes a quality issue
    5. Activate fallback — switch to secondary model if primary provider is degraded
    6. Capture a trace — sample 10 recent requests from the affected endpoint
    7. Post-mortem — document root cause, timeline, and prevention within 24 hours

    Key Takeaways

    • You cannot improve what you cannot measure — every LLM endpoint needs latency, token, cost, and error metrics from day one
    • The four golden signals for AI: Latency + TTFT, Traffic + Cost, Errors + Quality, Saturation + Context utilisation
    • Structured JSON logs with trace_id + user_id + tokens + cost + quality scores enable every investigation
    • Alert on cost trends, not just absolute values — a growing cost/request trend at 3am is more dangerous than a known peak
    • Quality metrics (faithfulness, relevance) require an eval pipeline but are the only true measure of LLM health
    • Separate retrieval latency from LLM latency in traces — they degrade for completely different reasons and require different fixes
    • Keep a fallback model ready — circuit breakers + model switching should be tested before you need them in a real incident

    Practice Exercises

    Exercise 1 — Starter (2 hours): Instrument an existing LLM application with OpenTelemetry. Add span attributes for input_tokens, output_tokens, latency_ms, and model. Export to a local Jaeger instance and verify you can see the trace. Calculate the cost of each request from the span attributes.

    Exercise 2 — Intermediate (half day): Build a Grafana dashboard from scratch with 8 panels covering the four golden signals. Add a Prometheus alert rule for cost spikes (>$50/hour) and p95 latency (>3s). Simulate the alert conditions locally and verify the alert fires within 2 minutes.

    Exercise 3 — Advanced (full day): Implement a weekly quality report. Set up RAGAS to score faithfulness and answer relevance on a sample of 50 production queries per week (using stored logs). Build a Grafana panel that shows the quality trend over the last 8 weeks. Write an automated alert that sends a Slack notification if the weekly average drops more than 5% from the prior week.

    Ask about this article

    Get answers grounded in this post. AI-generated — based on this article, and may be imperfect.

    Scaled AI Weekly

    Enjoyed this? Get more like it every Monday.

    Real architecture decisions, LLMOps patterns that survive production, and engineering leadership advice — from 12+ years of building at enterprise scale. Free. No spam. Unsubscribe anytime.

    Join engineers building production AI systems