Building Fault-Tolerant AI Systems

What you'll learn: By the end of this guide you will be able to implement circuit breakers and fallback chains for LLM calls, design graceful degradation UX patterns, apply bulkhead isolation between AI components, and test failure modes systematically before they hit production.

LLM providers have outages. OpenAI's status page shows multiple incidents per quarter. Anthropic, Groq, and every other provider have the same pattern. The question is not whether your LLM dependency will fail — it is whether your system handles that failure gracefully or cascades into a full outage.

The patterns in this guide come from operating AI systems that serve production traffic 24/7 with a 99.9% uptime SLA.

The Resilience Stack

flowchart TD REQ[Incoming Request] --> CB{Circuit Breaker\nOpen?} CB -->|Closed - normal| PRIMARY[Primary LLM\nGPT-4o / Claude Sonnet] CB -->|Open - failing| FALLBACK PRIMARY -->|Success| RESP[Response] PRIMARY -->|Failure| RETRY{Retry?\nAttempt < 3} RETRY -->|Yes| WAIT[Exponential Backoff\n+ Jitter] WAIT --> PRIMARY RETRY -->|No| FALLBACK[Fallback LLM\nGPT-4o-mini / Claude Haiku] FALLBACK -->|Success| RESP FALLBACK -->|Failure| DEGRADE[Graceful Degradation\nCached / Static Response] DEGRADE --> RESP style CB fill:#2d1a0d,stroke:#e67e22,color:#fff style PRIMARY fill:#0d2d3a,stroke:#06b6d4,color:#fff style FALLBACK fill:#1a0d2d,stroke:#9333ea,color:#fff style DEGRADE fill:#1a2d0d,stroke:#10b981,color:#fff

Circuit Breakers for LLM Calls

A circuit breaker stops calling a failing dependency for a defined period, preventing cascade failures and allowing the downstream service to recover.

@Service
public class ResilientLLMService {

    private final ChatClient primaryClient;    // GPT-4o
    private final ChatClient fallbackClient;   // GPT-4o-mini

    // Circuit breaker: opens after 5 failures in 10s, tries again after 30s
    @CircuitBreaker(
        name = "openai-primary",
        fallbackMethod = "fallbackAnswer"
    )
    @Retry(name = "openai-retry")  // 3 retries with exponential backoff
    @TimeLimiter(name = "openai-timeout")  // 8 second hard timeout
    public CompletableFuture<String> answer(String question) {
        return CompletableFuture.supplyAsync(() ->
            primaryClient.prompt()
                .user(question)
                .call()
                .content()
        );
    }

    // Automatically called when circuit is open or retries exhausted
    public CompletableFuture<String> fallbackAnswer(String question, Exception ex) {
        log.warn("Primary LLM unavailable ({}), using fallback", ex.getClass().getSimpleName());

        return CompletableFuture.supplyAsync(() ->
            fallbackClient.prompt()
                .user(question)
                .call()
                .content()
        );
    }
}

Resilience4j configuration:

resilience4j:
  circuitbreaker:
    instances:
      openai-primary:
        slidingWindowSize: 10
        failureRateThreshold: 50          # Open after 50% failures in window
        waitDurationInOpenState: 30s      # Wait 30s before half-open
        permittedNumberOfCallsInHalfOpenState: 3
  retry:
    instances:
      openai-retry:
        maxAttempts: 3
        waitDuration: 1s
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2
        retryExceptions:
          - java.io.IOException
          - org.springframework.web.client.ResourceAccessException
  timelimiter:
    instances:
      openai-timeout:
        timeoutDuration: 8s

Fallback Chain Design

Design your fallback chain based on cost/quality trade-offs:

Tier	Model	Cost	Quality	Use when
Primary	GPT-4o / Claude Sonnet	High	Best	Normal operation
Fallback 1	GPT-4o-mini / Claude Haiku	Low	Good	Primary degraded
Fallback 2	Cached response	None	Stale	Both models down
Fallback 3	Static message	None	Minimal	Complete outage

The fallback chain must be tested regularly — a fallback you haven't tested is a fallback that won't work when you need it.

Graceful Degradation UX

When AI is unavailable, the user experience should degrade gracefully — not break entirely.

Pattern 1: Cached response — return the last known-good response for the same query hash. Stale is better than nothing for most queries.

Pattern 2: Template response — return a structured but non-AI response: "I'm having trouble answering right now. Here are the top 3 related articles: [links]."

Pattern 3: Queue + async — accept the request, queue it, process when AI recovers, notify the user. Appropriate for non-real-time queries.

@Service
public class DegradationStrategy {

    private final ResponseCache cache;
    private final ArticleRepository articles;

    public String getDegradedResponse(String userQuery) {
        // Try cache first
        Optional<String> cached = cache.get(userQuery);
        if (cached.isPresent()) {
            return cached.get() + "\n\n_(Cached response — live AI temporarily unavailable)_";
        }

        // Fall back to template with related articles
        List<Article> related = articles.findRelated(userQuery, 3);
        return buildTemplateResponse(userQuery, related);
    }

    private String buildTemplateResponse(String query, List<Article> related) {
        StringBuilder sb = new StringBuilder();
        sb.append("I'm having trouble connecting to my AI right now. ");
        sb.append("Here are the most relevant resources for your question:\n\n");
        related.forEach(a -> sb.append("- **").append(a.getTitle()).append("**\n"));
        sb.append("\nPlease try again in a few minutes.");
        return sb.toString();
    }
}

Bulkhead Pattern: Isolate AI Components

Without bulkheads, a slow AI call can exhaust your entire thread pool, taking down unrelated features.

// Separate thread pools for each AI component
@Bean
public ThreadPoolBulkhead aiChatBulkhead() {
    return ThreadPoolBulkhead.of("ai-chat",
        ThreadPoolBulkheadConfig.custom()
            .maxThreadPoolSize(20)
            .coreThreadPoolSize(10)
            .queueCapacity(50)
            .build()
    );
}

@Bean
public ThreadPoolBulkhead aiSearchBulkhead() {
    return ThreadPoolBulkhead.of("ai-search",
        ThreadPoolBulkheadConfig.custom()
            .maxThreadPoolSize(10)
            .coreThreadPoolSize(5)
            .queueCapacity(25)
            .build()
    );
}

This ensures a slow chat response never blocks search — and vice versa.

Testing Failure Modes

Never find out about resilience gaps during a real outage. Test proactively:

@Test
void circuitBreakerOpensAfterFiveFailures() {
    // Simulate 5 consecutive failures
    when(primaryLLM.call(any())).thenThrow(new IOException("Connection refused"));

    for (int i = 0; i < 5; i++) {
        assertThrows(Exception.class, () -> service.answer("test"));
    }

    // Circuit should now be open — fallback should activate
    String response = service.answer("what happens now?");
    assertThat(response).isNotNull();
    verify(fallbackLLM, atLeastOnce()).call(any());
}

@Test
void gracefulDegradationWhenBothModelsDown() {
    when(primaryLLM.call(any())).thenThrow(new IOException());
    when(fallbackLLM.call(any())).thenThrow(new IOException());

    String response = service.answer("help me");

    // Should return degraded response, not throw
    assertThat(response).contains("having trouble");
    assertThat(response).doesNotContain("500");
}

Key Takeaways

Assume your LLM provider will fail — build the fallback chain before you need it, not during an incident
Circuit breakers prevent cascade failures by stopping calls to a failing service for a defined period
The fallback chain should be: primary → smaller model → cached response → static message
Bulkheads isolate AI components so slow LLM calls don't starve unrelated features of threads
Retry with exponential backoff + jitter — never linear retries. Jitter prevents thundering herd at provider recovery
Test every failure mode before production. An untested fallback is a false safety net
Graceful degradation is a product decision as much as a technical one — agree with stakeholders on acceptable degraded behaviour

Practice Exercises

Exercise 1 — Starter (1 hour): Add Resilience4j to an existing Spring Boot LLM service. Configure a circuit breaker with a 30-second wait time and 50% failure threshold. Write a unit test that verifies the circuit opens after 5 failures. Confirm the fallback method is called.

Exercise 2 — Intermediate (2–3 hours): Implement a complete fallback chain: primary GPT-4o → fallback GPT-4o-mini → cached response. Write integration tests for each tier. Then simulate provider failure by blocking the OpenAI endpoint in your hosts file and verify each fallback activates in sequence.

Exercise 3 — Advanced (half day): Implement bulkhead isolation for 3 AI features (chat, search, summarisation). Write a load test that saturates the chat bulkhead. Verify that search and summarisation remain responsive during chat saturation. Set up Grafana panels showing thread pool utilisation per bulkhead and add alerts when queue depth exceeds 80% capacity.

Building Fault-Tolerant AI Systems

Building Fault-Tolerant AI Systems

The Resilience Stack

Circuit Breakers for LLM Calls

Fallback Chain Design

Graceful Degradation UX

Bulkhead Pattern: Isolate AI Components

Testing Failure Modes

Key Takeaways

Practice Exercises

Ask about this article

Want to go from reading
to building?

Building Fault-Tolerant AI Systems

Building Fault-Tolerant AI Systems

The Resilience Stack

Circuit Breakers for LLM Calls

Fallback Chain Design

Graceful Degradation UX

Bulkhead Pattern: Isolate AI Components

Testing Failure Modes

Key Takeaways

Practice Exercises

Ask about this article

Want to go from reading to building?

Want to go from reading
to building?