Building Fault-Tolerant AI Systems
What you'll learn: By the end of this guide you will be able to implement circuit breakers and fallback chains for LLM calls, design graceful degradation UX patterns, apply bulkhead isolation between AI components, and test failure modes systematically before they hit production.
LLM providers have outages. OpenAI's status page shows multiple incidents per quarter. Anthropic, Groq, and every other provider have the same pattern. The question is not whether your LLM dependency will fail — it is whether your system handles that failure gracefully or cascades into a full outage.
The patterns in this guide come from operating AI systems that serve production traffic 24/7 with a 99.9% uptime SLA.
The Resilience Stack
Circuit Breakers for LLM Calls
A circuit breaker stops calling a failing dependency for a defined period, preventing cascade failures and allowing the downstream service to recover.
@Service
public class ResilientLLMService {
private final ChatClient primaryClient; // GPT-4o
private final ChatClient fallbackClient; // GPT-4o-mini
// Circuit breaker: opens after 5 failures in 10s, tries again after 30s
@CircuitBreaker(
name = "openai-primary",
fallbackMethod = "fallbackAnswer"
)
@Retry(name = "openai-retry") // 3 retries with exponential backoff
@TimeLimiter(name = "openai-timeout") // 8 second hard timeout
public CompletableFuture<String> answer(String question) {
return CompletableFuture.supplyAsync(() ->
primaryClient.prompt()
.user(question)
.call()
.content()
);
}
// Automatically called when circuit is open or retries exhausted
public CompletableFuture<String> fallbackAnswer(String question, Exception ex) {
log.warn("Primary LLM unavailable ({}), using fallback", ex.getClass().getSimpleName());
return CompletableFuture.supplyAsync(() ->
fallbackClient.prompt()
.user(question)
.call()
.content()
);
}
}
Resilience4j configuration:
resilience4j:
circuitbreaker:
instances:
openai-primary:
slidingWindowSize: 10
failureRateThreshold: 50 # Open after 50% failures in window
waitDurationInOpenState: 30s # Wait 30s before half-open
permittedNumberOfCallsInHalfOpenState: 3
retry:
instances:
openai-retry:
maxAttempts: 3
waitDuration: 1s
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2
retryExceptions:
- java.io.IOException
- org.springframework.web.client.ResourceAccessException
timelimiter:
instances:
openai-timeout:
timeoutDuration: 8s
Fallback Chain Design
Design your fallback chain based on cost/quality trade-offs:
| Tier | Model | Cost | Quality | Use when |
|---|---|---|---|---|
| Primary | GPT-4o / Claude Sonnet | High | Best | Normal operation |
| Fallback 1 | GPT-4o-mini / Claude Haiku | Low | Good | Primary degraded |
| Fallback 2 | Cached response | None | Stale | Both models down |
| Fallback 3 | Static message | None | Minimal | Complete outage |
The fallback chain must be tested regularly — a fallback you haven't tested is a fallback that won't work when you need it.
Graceful Degradation UX
When AI is unavailable, the user experience should degrade gracefully — not break entirely.
Pattern 1: Cached response — return the last known-good response for the same query hash. Stale is better than nothing for most queries.
Pattern 2: Template response — return a structured but non-AI response: "I'm having trouble answering right now. Here are the top 3 related articles: [links]."
Pattern 3: Queue + async — accept the request, queue it, process when AI recovers, notify the user. Appropriate for non-real-time queries.
@Service
public class DegradationStrategy {
private final ResponseCache cache;
private final ArticleRepository articles;
public String getDegradedResponse(String userQuery) {
// Try cache first
Optional<String> cached = cache.get(userQuery);
if (cached.isPresent()) {
return cached.get() + "\n\n_(Cached response — live AI temporarily unavailable)_";
}
// Fall back to template with related articles
List<Article> related = articles.findRelated(userQuery, 3);
return buildTemplateResponse(userQuery, related);
}
private String buildTemplateResponse(String query, List<Article> related) {
StringBuilder sb = new StringBuilder();
sb.append("I'm having trouble connecting to my AI right now. ");
sb.append("Here are the most relevant resources for your question:\n\n");
related.forEach(a -> sb.append("- **").append(a.getTitle()).append("**\n"));
sb.append("\nPlease try again in a few minutes.");
return sb.toString();
}
}
Bulkhead Pattern: Isolate AI Components
Without bulkheads, a slow AI call can exhaust your entire thread pool, taking down unrelated features.
// Separate thread pools for each AI component
@Bean
public ThreadPoolBulkhead aiChatBulkhead() {
return ThreadPoolBulkhead.of("ai-chat",
ThreadPoolBulkheadConfig.custom()
.maxThreadPoolSize(20)
.coreThreadPoolSize(10)
.queueCapacity(50)
.build()
);
}
@Bean
public ThreadPoolBulkhead aiSearchBulkhead() {
return ThreadPoolBulkhead.of("ai-search",
ThreadPoolBulkheadConfig.custom()
.maxThreadPoolSize(10)
.coreThreadPoolSize(5)
.queueCapacity(25)
.build()
);
}
This ensures a slow chat response never blocks search — and vice versa.
Testing Failure Modes
Never find out about resilience gaps during a real outage. Test proactively:
@Test
void circuitBreakerOpensAfterFiveFailures() {
// Simulate 5 consecutive failures
when(primaryLLM.call(any())).thenThrow(new IOException("Connection refused"));
for (int i = 0; i < 5; i++) {
assertThrows(Exception.class, () -> service.answer("test"));
}
// Circuit should now be open — fallback should activate
String response = service.answer("what happens now?");
assertThat(response).isNotNull();
verify(fallbackLLM, atLeastOnce()).call(any());
}
@Test
void gracefulDegradationWhenBothModelsDown() {
when(primaryLLM.call(any())).thenThrow(new IOException());
when(fallbackLLM.call(any())).thenThrow(new IOException());
String response = service.answer("help me");
// Should return degraded response, not throw
assertThat(response).contains("having trouble");
assertThat(response).doesNotContain("500");
}
Key Takeaways
- Assume your LLM provider will fail — build the fallback chain before you need it, not during an incident
- Circuit breakers prevent cascade failures by stopping calls to a failing service for a defined period
- The fallback chain should be: primary → smaller model → cached response → static message
- Bulkheads isolate AI components so slow LLM calls don't starve unrelated features of threads
- Retry with exponential backoff + jitter — never linear retries. Jitter prevents thundering herd at provider recovery
- Test every failure mode before production. An untested fallback is a false safety net
- Graceful degradation is a product decision as much as a technical one — agree with stakeholders on acceptable degraded behaviour
Practice Exercises
Exercise 1 — Starter (1 hour): Add Resilience4j to an existing Spring Boot LLM service. Configure a circuit breaker with a 30-second wait time and 50% failure threshold. Write a unit test that verifies the circuit opens after 5 failures. Confirm the fallback method is called.
Exercise 2 — Intermediate (2–3 hours): Implement a complete fallback chain: primary GPT-4o → fallback GPT-4o-mini → cached response. Write integration tests for each tier. Then simulate provider failure by blocking the OpenAI endpoint in your hosts file and verify each fallback activates in sequence.
Exercise 3 — Advanced (half day): Implement bulkhead isolation for 3 AI features (chat, search, summarisation). Write a load test that saturates the chat bulkhead. Verify that search and summarisation remain responsive during chat saturation. Set up Grafana panels showing thread pool utilisation per bulkhead and add alerts when queue depth exceeds 80% capacity.