Architecture
    June 2, 2026

    Building Fault-Tolerant AI Systems

    Resilience patterns for production AI — circuit breakers, fallback chains, and graceful degradation so systems survive provider outages and rate limits.

    Share

    Building Fault-Tolerant AI Systems

    What you'll learn: By the end of this guide you will be able to implement circuit breakers and fallback chains for LLM calls, design graceful degradation UX patterns, apply bulkhead isolation between AI components, and test failure modes systematically before they hit production.

    LLM providers have outages. OpenAI's status page shows multiple incidents per quarter. Anthropic, Groq, and every other provider have the same pattern. The question is not whether your LLM dependency will fail — it is whether your system handles that failure gracefully or cascades into a full outage.

    The patterns in this guide come from operating AI systems that serve production traffic 24/7 with a 99.9% uptime SLA.

    The Resilience Stack

    flowchart TD REQ[Incoming Request] --> CB{Circuit Breaker\nOpen?} CB -->|Closed - normal| PRIMARY[Primary LLM\nGPT-4o / Claude Sonnet] CB -->|Open - failing| FALLBACK PRIMARY -->|Success| RESP[Response] PRIMARY -->|Failure| RETRY{Retry?\nAttempt < 3} RETRY -->|Yes| WAIT[Exponential Backoff\n+ Jitter] WAIT --> PRIMARY RETRY -->|No| FALLBACK[Fallback LLM\nGPT-4o-mini / Claude Haiku] FALLBACK -->|Success| RESP FALLBACK -->|Failure| DEGRADE[Graceful Degradation\nCached / Static Response] DEGRADE --> RESP style CB fill:#2d1a0d,stroke:#e67e22,color:#fff style PRIMARY fill:#0d2d3a,stroke:#06b6d4,color:#fff style FALLBACK fill:#1a0d2d,stroke:#9333ea,color:#fff style DEGRADE fill:#1a2d0d,stroke:#10b981,color:#fff

    Circuit Breakers for LLM Calls

    A circuit breaker stops calling a failing dependency for a defined period, preventing cascade failures and allowing the downstream service to recover.

    @Service
    public class ResilientLLMService {
    
        private final ChatClient primaryClient;    // GPT-4o
        private final ChatClient fallbackClient;   // GPT-4o-mini
    
        // Circuit breaker: opens after 5 failures in 10s, tries again after 30s
        @CircuitBreaker(
            name = "openai-primary",
            fallbackMethod = "fallbackAnswer"
        )
        @Retry(name = "openai-retry")  // 3 retries with exponential backoff
        @TimeLimiter(name = "openai-timeout")  // 8 second hard timeout
        public CompletableFuture<String> answer(String question) {
            return CompletableFuture.supplyAsync(() ->
                primaryClient.prompt()
                    .user(question)
                    .call()
                    .content()
            );
        }
    
        // Automatically called when circuit is open or retries exhausted
        public CompletableFuture<String> fallbackAnswer(String question, Exception ex) {
            log.warn("Primary LLM unavailable ({}), using fallback", ex.getClass().getSimpleName());
    
            return CompletableFuture.supplyAsync(() ->
                fallbackClient.prompt()
                    .user(question)
                    .call()
                    .content()
            );
        }
    }

    Resilience4j configuration:

    resilience4j:
      circuitbreaker:
        instances:
          openai-primary:
            slidingWindowSize: 10
            failureRateThreshold: 50          # Open after 50% failures in window
            waitDurationInOpenState: 30s      # Wait 30s before half-open
            permittedNumberOfCallsInHalfOpenState: 3
      retry:
        instances:
          openai-retry:
            maxAttempts: 3
            waitDuration: 1s
            enableExponentialBackoff: true
            exponentialBackoffMultiplier: 2
            retryExceptions:
              - java.io.IOException
              - org.springframework.web.client.ResourceAccessException
      timelimiter:
        instances:
          openai-timeout:
            timeoutDuration: 8s

    Fallback Chain Design

    Design your fallback chain based on cost/quality trade-offs:

    Tier Model Cost Quality Use when
    Primary GPT-4o / Claude Sonnet High Best Normal operation
    Fallback 1 GPT-4o-mini / Claude Haiku Low Good Primary degraded
    Fallback 2 Cached response None Stale Both models down
    Fallback 3 Static message None Minimal Complete outage

    The fallback chain must be tested regularly — a fallback you haven't tested is a fallback that won't work when you need it.

    Graceful Degradation UX

    When AI is unavailable, the user experience should degrade gracefully — not break entirely.

    Pattern 1: Cached response — return the last known-good response for the same query hash. Stale is better than nothing for most queries.

    Pattern 2: Template response — return a structured but non-AI response: "I'm having trouble answering right now. Here are the top 3 related articles: [links]."

    Pattern 3: Queue + async — accept the request, queue it, process when AI recovers, notify the user. Appropriate for non-real-time queries.

    @Service
    public class DegradationStrategy {
    
        private final ResponseCache cache;
        private final ArticleRepository articles;
    
        public String getDegradedResponse(String userQuery) {
            // Try cache first
            Optional<String> cached = cache.get(userQuery);
            if (cached.isPresent()) {
                return cached.get() + "\n\n_(Cached response — live AI temporarily unavailable)_";
            }
    
            // Fall back to template with related articles
            List<Article> related = articles.findRelated(userQuery, 3);
            return buildTemplateResponse(userQuery, related);
        }
    
        private String buildTemplateResponse(String query, List<Article> related) {
            StringBuilder sb = new StringBuilder();
            sb.append("I'm having trouble connecting to my AI right now. ");
            sb.append("Here are the most relevant resources for your question:\n\n");
            related.forEach(a -> sb.append("- **").append(a.getTitle()).append("**\n"));
            sb.append("\nPlease try again in a few minutes.");
            return sb.toString();
        }
    }

    Bulkhead Pattern: Isolate AI Components

    Without bulkheads, a slow AI call can exhaust your entire thread pool, taking down unrelated features.

    // Separate thread pools for each AI component
    @Bean
    public ThreadPoolBulkhead aiChatBulkhead() {
        return ThreadPoolBulkhead.of("ai-chat",
            ThreadPoolBulkheadConfig.custom()
                .maxThreadPoolSize(20)
                .coreThreadPoolSize(10)
                .queueCapacity(50)
                .build()
        );
    }
    
    @Bean
    public ThreadPoolBulkhead aiSearchBulkhead() {
        return ThreadPoolBulkhead.of("ai-search",
            ThreadPoolBulkheadConfig.custom()
                .maxThreadPoolSize(10)
                .coreThreadPoolSize(5)
                .queueCapacity(25)
                .build()
        );
    }

    This ensures a slow chat response never blocks search — and vice versa.

    Testing Failure Modes

    Never find out about resilience gaps during a real outage. Test proactively:

    @Test
    void circuitBreakerOpensAfterFiveFailures() {
        // Simulate 5 consecutive failures
        when(primaryLLM.call(any())).thenThrow(new IOException("Connection refused"));
    
        for (int i = 0; i < 5; i++) {
            assertThrows(Exception.class, () -> service.answer("test"));
        }
    
        // Circuit should now be open — fallback should activate
        String response = service.answer("what happens now?");
        assertThat(response).isNotNull();
        verify(fallbackLLM, atLeastOnce()).call(any());
    }
    
    @Test
    void gracefulDegradationWhenBothModelsDown() {
        when(primaryLLM.call(any())).thenThrow(new IOException());
        when(fallbackLLM.call(any())).thenThrow(new IOException());
    
        String response = service.answer("help me");
    
        // Should return degraded response, not throw
        assertThat(response).contains("having trouble");
        assertThat(response).doesNotContain("500");
    }

    Key Takeaways

    • Assume your LLM provider will fail — build the fallback chain before you need it, not during an incident
    • Circuit breakers prevent cascade failures by stopping calls to a failing service for a defined period
    • The fallback chain should be: primary → smaller model → cached response → static message
    • Bulkheads isolate AI components so slow LLM calls don't starve unrelated features of threads
    • Retry with exponential backoff + jitter — never linear retries. Jitter prevents thundering herd at provider recovery
    • Test every failure mode before production. An untested fallback is a false safety net
    • Graceful degradation is a product decision as much as a technical one — agree with stakeholders on acceptable degraded behaviour

    Practice Exercises

    Exercise 1 — Starter (1 hour): Add Resilience4j to an existing Spring Boot LLM service. Configure a circuit breaker with a 30-second wait time and 50% failure threshold. Write a unit test that verifies the circuit opens after 5 failures. Confirm the fallback method is called.

    Exercise 2 — Intermediate (2–3 hours): Implement a complete fallback chain: primary GPT-4o → fallback GPT-4o-mini → cached response. Write integration tests for each tier. Then simulate provider failure by blocking the OpenAI endpoint in your hosts file and verify each fallback activates in sequence.

    Exercise 3 — Advanced (half day): Implement bulkhead isolation for 3 AI features (chat, search, summarisation). Write a load test that saturates the chat bulkhead. Verify that search and summarisation remain responsive during chat saturation. Set up Grafana panels showing thread pool utilisation per bulkhead and add alerts when queue depth exceeds 80% capacity.

    Ask about this article

    Get answers grounded in this post. AI-generated — based on this article, and may be imperfect.

    Go deeper

    Want to go from reading to building?

    Take it further with the free, hands-on courses — structured paths that turn these ideas into working systems, with code and exercises.

    Article: Building Fault-Tolerant AI Systems