Zero to Production: Building Your First Enterprise LLM Application
What you'll learn: By the end of this guide you will have a 4-phase mental model for taking any LLM application to production, know exactly what to build in the first 48 hours vs. 30 days vs. 90 days, and be able to implement circuit breakers, prompt caching, model routing, and evaluation pipelines.
Every enterprise LLM project I've seen starts the same way. A developer spends a weekend with the OpenAI API. The demo wows the stakeholders. Three months later, the team is fighting hallucinations, runaway costs, 3am paging incidents, and a product that works beautifully in the office and embarrassingly in production.
The gap between "it works in the notebook" and "it works at 3am when the CTO is watching" is not a gap in model capability — it is a gap in engineering discipline.
This post walks through the four phases I use to take any LLM application from prototype to production. I've applied this pattern across AI platforms serving tens of thousands of enterprise users.
Phase 1 — Day 1: The Right Kind of Prototype
Most prototypes are too clever. Engineers reach for LangChain, agents, and tool orchestration before validating whether the core LLM interaction is actually useful. Resist that.
On day one, build the simplest thing that proves the value:
@Service
public class LLMService {
private final ChatClient chatClient;
public LLMService(ChatClient.Builder builder) {
this.chatClient = builder
.defaultSystem("""
You are an enterprise support assistant for Acme Corp.
Answer questions about our products concisely and accurately.
If you are not certain, say so — do not guess.
""")
.build();
}
public String answer(String question) {
return chatClient.prompt()
.user(question)
.call()
.content();
}
}
That is your entire application on day one. No vector database. No agents. No framework abstractions. One LLM call.
What you must get right on day one:
- System prompt structure — write it as if it were a contract, not a suggestion. Role, constraints, output format, fallback instruction.
- Error handling — every LLM call can fail. Wrap it in a circuit breaker from the start.
- An evaluation set — before writing more code, collect 20–30 representative questions and their correct answers. This becomes your regression test suite for every change.
Figure 1: Day 1 architecture — deliberately minimal
Phase 2 — Day 7: Add Knowledge with RAG
Hallucination is not a model defect you can engineer around — it is an architectural gap. The model does not know your product documentation, your internal policies, or anything that happened after its training cutoff. The fix is Retrieval Augmented Generation.
RAG gives the model access to a searchable knowledge store at inference time. The model is not memorising your documents — it is reading the relevant sections at the moment of the question.
Ingestion pipeline (run once, or on document update):
@Service
public class DocumentIngestionService {
private final VectorStore vectorStore;
private final TokenTextSplitter splitter = new TokenTextSplitter(512, 64);
public void ingest(String content, Map<String, Object> metadata) {
List<Document> chunks = splitter.apply(
List.of(new Document(content, metadata))
);
vectorStore.add(chunks);
}
}
Query pipeline (every user request):
public String answerWithContext(String question) {
List<Document> context = vectorStore.similaritySearch(
SearchRequest.query(question).withTopK(5)
);
String contextText = context.stream()
.map(Document::getContent)
.collect(Collectors.joining("\n\n---\n\n"));
return chatClient.prompt()
.system("""
Answer ONLY from the provided context.
If the answer is not in the context, say:
"I don't have information about this."
Never supplement with information outside the context.
""")
.user(u -> u.text("Context:\n{ctx}\n\nQuestion: {q}")
.param("ctx", contextText)
.param("q", question))
.call()
.content();
}
Figure 2: RAG pipeline — ingestion (top) runs independently from query (bottom)
Use pgvector first. It runs inside your existing PostgreSQL instance. No new infrastructure, no new ops burden. Move to a dedicated vector store (Pinecone, Weaviate) only when you've outgrown it.
Phase 3 — Day 30: Production Hardening
Your prototype is now delivering real value. Users are using it. Which means it will fail in ways you did not anticipate. This phase is about building the guardrails before they are needed.
Circuit Breaker + Fallback Model
Never let a single LLM provider outage take your entire product down:
@CircuitBreaker(name = "openai", fallbackMethod = "fallbackAnswer")
@RateLimiter(name = "openai")
public String answer(String question) {
return primaryClient.answer(question);
}
// Falls back to a secondary model automatically
public String fallbackAnswer(String question, Exception ex) {
log.warn("Primary LLM unavailable, using fallback: {}", ex.getMessage());
return fallbackClient.answer(question);
}
Configure the circuit breaker to open after 5 failures in 10 seconds, wait 30 seconds, then try again. This is standard Resilience4j configuration.
Prompt Caching
Many enterprise users ask similar questions. Anthropic's prompt caching and a Redis response cache together can cut costs by 60–80%:
// Semantic cache: check before calling the LLM
public String answerwithCache(String question) {
String cacheKey = semanticCacheKey(question); // hash of embedding
String cached = redis.get(cacheKey);
if (cached != null) return cached;
String answer = answerWithContext(question);
redis.setex(cacheKey, 3600, answer); // cache for 1 hour
return answer;
}
Structured Observability
Every LLM interaction should produce a structured log entry:
{
"traceId": "abc-123",
"userId": "u-456",
"inputTokens": 1847,
"outputTokens": 312,
"latencyMs": 1240,
"model": "gpt-4o",
"cacheHit": false,
"costUsd": 0.0124
}
Track these metrics in Grafana. Alert when cost per request exceeds your SLA, when latency p95 exceeds 3 seconds, or when error rate exceeds 1%.
Phase 4 — Day 90: Scale and Optimise
Your system is stable. Now make it economical and scalable.
Model Routing
Not every question needs GPT-4o. Build a router that sends simple questions to a cheaper, faster model:
public String routedAnswer(String question) {
// Classify complexity — use a tiny model for this
boolean isComplex = complexityClassifier.isComplex(question);
return isComplex
? smartClient.answer(question) // GPT-4o or Claude Sonnet
: fastClient.answer(question); // GPT-4o-mini or Claude Haiku
}
In practice, 60–70% of questions in most enterprise support applications can be handled by the cheaper model with no quality loss.
Evaluation Pipeline
Every two weeks, run your evaluation set against the live system and score each answer. Track quality score over time. A drop in quality score is an early warning that something has regressed — a prompt change, a model update, or document drift.
Figure 3: Day 90 production architecture — semantic cache, model routing, RAG, circuit breaker, observability
What Not To Do
These are the mistakes I see most often on first enterprise LLM projects:
Sharing a database with the LLM service. Your AI service will have different scaling characteristics than your transactional DB. Run it separately from day one.
Skipping evals. "It looks good" is not an evaluation. You need numbers. Build the eval set before you write production code, not after the first incident.
No fallback model. LLM providers have outages. OpenAI has status page incidents multiple times per quarter. A fallback is not optional.
Logging just the response. Log input tokens, output tokens, latency, cost, and model. You will need this data when your LLM bill is higher than expected.
Overcomplicated day-one architecture. Agents, multi-step chains, and custom tool orchestration before you have proven product-market fit is engineering debt disguised as ambition.
Summary: The Four-Phase Checklist
| Phase | Timeline | What to build |
|---|---|---|
| Prototype | Day 1–2 | Single LLM call · system prompt · eval set · error handling |
| RAG | Day 3–7 | Ingestion pipeline · pgvector · context injection · grounding |
| Hardening | Day 8–30 | Circuit breaker · prompt cache · rate limiting · observability |
| Scale | Day 31–90 | Model routing · semantic cache · eval pipeline · cost dashboard |
The engineers who build the best enterprise AI systems are the ones who treat LLMs like any other external dependency: unreliable, expensive, and absolutely worth abstracting behind a well-designed service boundary.
Start simple. Measure everything. Add complexity only when your data tells you to.
Key Takeaways
- Day 1: one LLM call, a system prompt, and an eval set. Nothing else. Validate the core value before adding infrastructure
- Circuit breakers + retry with exponential backoff are not optional — LLM providers have outages multiple times per quarter
- Prompt caching (static prefix first, dynamic content last) reduces costs 70–90% at high volume — implement by Day 30
- Model routing: send 70% of queries to the cheap model, 30% to the expensive one — typical cost reduction 75% with <2% quality loss
- Evaluation is the most important investment — without a quality score, you fly blind through every change
- The four-phase checklist (Day 1 → 7 → 30 → 90) maps directly to prototype → RAG → hardening → optimisation
- "It looks good" is not an evaluation. You need numbers before shipping to production
Practice Exercises
Exercise 1 — Starter (2 hours): Build the Day 1 prototype. Use Spring AI or LangChain to create a single-endpoint LLM service with a well-structured system prompt. Write an evaluation set of 20 questions with expected answers. Run your eval and get a baseline accuracy score before writing any more code.
Exercise 2 — Intermediate (1 day): Add RAG to your Day 1 prototype. Ingest 50 documents into pgvector. Measure how accuracy improves vs. the baseline. Then add a circuit breaker with Resilience4j and test it by pointing your service at a non-existent endpoint. Verify the fallback model activates correctly.
Exercise 3 — Advanced (2–3 days): Implement the full Day 90 architecture. Add semantic response caching with Redis, model routing based on a complexity classifier, and a bi-weekly eval pipeline that runs automatically. Deploy with a canary at 10% traffic and set up an alert that auto-rolls-back if error rate exceeds 0.5% during the canary period.
Related reading
- RAG Systems Explained — the retrieval layer that turns a demo into a useful enterprise app.