AI
    June 2, 2026

    Zero to Production: Building Your First Enterprise LLM Application

    A four-phase guide to taking an LLM prototype to a production enterprise app — RAG, caching, observability, cost control, and multi-model routing.

    Share

    Zero to Production: Building Your First Enterprise LLM Application

    What you'll learn: By the end of this guide you will have a 4-phase mental model for taking any LLM application to production, know exactly what to build in the first 48 hours vs. 30 days vs. 90 days, and be able to implement circuit breakers, prompt caching, model routing, and evaluation pipelines.

    Every enterprise LLM project I've seen starts the same way. A developer spends a weekend with the OpenAI API. The demo wows the stakeholders. Three months later, the team is fighting hallucinations, runaway costs, 3am paging incidents, and a product that works beautifully in the office and embarrassingly in production.

    The gap between "it works in the notebook" and "it works at 3am when the CTO is watching" is not a gap in model capability — it is a gap in engineering discipline.

    This post walks through the four phases I use to take any LLM application from prototype to production. I've applied this pattern across AI platforms serving tens of thousands of enterprise users.


    Phase 1 — Day 1: The Right Kind of Prototype

    Most prototypes are too clever. Engineers reach for LangChain, agents, and tool orchestration before validating whether the core LLM interaction is actually useful. Resist that.

    On day one, build the simplest thing that proves the value:

    @Service
    public class LLMService {
    
        private final ChatClient chatClient;
    
        public LLMService(ChatClient.Builder builder) {
            this.chatClient = builder
                .defaultSystem("""
                    You are an enterprise support assistant for Acme Corp.
                    Answer questions about our products concisely and accurately.
                    If you are not certain, say so — do not guess.
                    """)
                .build();
        }
    
        public String answer(String question) {
            return chatClient.prompt()
                .user(question)
                .call()
                .content();
        }
    }

    That is your entire application on day one. No vector database. No agents. No framework abstractions. One LLM call.

    What you must get right on day one:

    1. System prompt structure — write it as if it were a contract, not a suggestion. Role, constraints, output format, fallback instruction.
    2. Error handling — every LLM call can fail. Wrap it in a circuit breaker from the start.
    3. An evaluation set — before writing more code, collect 20–30 representative questions and their correct answers. This becomes your regression test suite for every change.
    flowchart LR U([User]) --> API[Spring Boot\nREST API] API --> LLM[OpenAI GPT-4o\nChatClient] LLM --> API API --> LOG[Structured\nLogger] API --> U style U fill:#0d2d3a,stroke:#00e5ff,color:#fff style LLM fill:#1a1a3a,stroke:#9b59b6,color:#fff style LOG fill:#2d1a0d,stroke:#e67e22,color:#fff

    Figure 1: Day 1 architecture — deliberately minimal


    Phase 2 — Day 7: Add Knowledge with RAG

    Hallucination is not a model defect you can engineer around — it is an architectural gap. The model does not know your product documentation, your internal policies, or anything that happened after its training cutoff. The fix is Retrieval Augmented Generation.

    RAG gives the model access to a searchable knowledge store at inference time. The model is not memorising your documents — it is reading the relevant sections at the moment of the question.

    Ingestion pipeline (run once, or on document update):

    @Service
    public class DocumentIngestionService {
    
        private final VectorStore vectorStore;
        private final TokenTextSplitter splitter = new TokenTextSplitter(512, 64);
    
        public void ingest(String content, Map<String, Object> metadata) {
            List<Document> chunks = splitter.apply(
                List.of(new Document(content, metadata))
            );
            vectorStore.add(chunks);
        }
    }

    Query pipeline (every user request):

    public String answerWithContext(String question) {
        List<Document> context = vectorStore.similaritySearch(
            SearchRequest.query(question).withTopK(5)
        );
    
        String contextText = context.stream()
            .map(Document::getContent)
            .collect(Collectors.joining("\n\n---\n\n"));
    
        return chatClient.prompt()
            .system("""
                Answer ONLY from the provided context.
                If the answer is not in the context, say:
                "I don't have information about this."
                Never supplement with information outside the context.
                """)
            .user(u -> u.text("Context:\n{ctx}\n\nQuestion: {q}")
                .param("ctx", contextText)
                .param("q", question))
            .call()
            .content();
    }
    flowchart TD subgraph INGEST[Ingestion Pipeline - run on document update] D[Source Documents\nConfluence, PDF, DB] --> SP[Text Splitter\n512 tokens, 64 overlap] SP --> EM[Embedding Model\ntext-embedding-3-small] EM --> VS[(pgvector\nPostgreSQL)] end subgraph QUERY[Query Pipeline - every user request] Q([User Question]) --> QEM[Embed Query] QEM --> SR[Similarity Search\nTop-5 chunks] VS --> SR SR --> PB[Prompt Builder\nSystem + Context + Question] PB --> LLM[LLM Generation\nGPT-4o] LLM --> A([Grounded Answer]) end style INGEST fill:#0d1a2d,stroke:#00e5ff,color:#fff style QUERY fill:#1a0d2d,stroke:#9b59b6,color:#fff style VS fill:#0d2d1a,stroke:#2ecc71,color:#fff style A fill:#1a2d0d,stroke:#2ecc71,color:#fff

    Figure 2: RAG pipeline — ingestion (top) runs independently from query (bottom)

    Use pgvector first. It runs inside your existing PostgreSQL instance. No new infrastructure, no new ops burden. Move to a dedicated vector store (Pinecone, Weaviate) only when you've outgrown it.


    Phase 3 — Day 30: Production Hardening

    Your prototype is now delivering real value. Users are using it. Which means it will fail in ways you did not anticipate. This phase is about building the guardrails before they are needed.

    Circuit Breaker + Fallback Model

    Never let a single LLM provider outage take your entire product down:

    @CircuitBreaker(name = "openai", fallbackMethod = "fallbackAnswer")
    @RateLimiter(name = "openai")
    public String answer(String question) {
        return primaryClient.answer(question);
    }
    
    // Falls back to a secondary model automatically
    public String fallbackAnswer(String question, Exception ex) {
        log.warn("Primary LLM unavailable, using fallback: {}", ex.getMessage());
        return fallbackClient.answer(question);
    }

    Configure the circuit breaker to open after 5 failures in 10 seconds, wait 30 seconds, then try again. This is standard Resilience4j configuration.

    Prompt Caching

    Many enterprise users ask similar questions. Anthropic's prompt caching and a Redis response cache together can cut costs by 60–80%:

    // Semantic cache: check before calling the LLM
    public String answerwithCache(String question) {
        String cacheKey = semanticCacheKey(question); // hash of embedding
        String cached = redis.get(cacheKey);
        if (cached != null) return cached;
    
        String answer = answerWithContext(question);
        redis.setex(cacheKey, 3600, answer); // cache for 1 hour
        return answer;
    }

    Structured Observability

    Every LLM interaction should produce a structured log entry:

    {
      "traceId": "abc-123",
      "userId": "u-456",
      "inputTokens": 1847,
      "outputTokens": 312,
      "latencyMs": 1240,
      "model": "gpt-4o",
      "cacheHit": false,
      "costUsd": 0.0124
    }

    Track these metrics in Grafana. Alert when cost per request exceeds your SLA, when latency p95 exceeds 3 seconds, or when error rate exceeds 1%.


    Phase 4 — Day 90: Scale and Optimise

    Your system is stable. Now make it economical and scalable.

    Model Routing

    Not every question needs GPT-4o. Build a router that sends simple questions to a cheaper, faster model:

    public String routedAnswer(String question) {
        // Classify complexity — use a tiny model for this
        boolean isComplex = complexityClassifier.isComplex(question);
    
        return isComplex
            ? smartClient.answer(question)   // GPT-4o or Claude Sonnet
            : fastClient.answer(question);    // GPT-4o-mini or Claude Haiku
    }

    In practice, 60–70% of questions in most enterprise support applications can be handled by the cheaper model with no quality loss.

    Evaluation Pipeline

    Every two weeks, run your evaluation set against the live system and score each answer. Track quality score over time. A drop in quality score is an early warning that something has regressed — a prompt change, a model update, or document drift.

    flowchart TD U([Users]) --> GW[API Gateway\nRate Limit · Auth · TLS] GW --> SC{Semantic\nCache?} SC -->|Cache Hit| CR([Cached Response]) SC -->|Miss| RT{Model Router\nComplexity Score} RT -->|Simple tasks| FM[GPT-4o-mini\nClaude Haiku] RT -->|Complex tasks| SM[GPT-4o\nClaude Sonnet] FM --> RAG[RAG Pipeline\npgvector Search] SM --> RAG RAG --> PB[Prompt Builder\n+ Prompt Cache] PB --> LLM[LLM API\nwith Circuit Breaker] LLM --> RESP[Response] RESP --> SC RESP --> U RESP --> OBS[OpenTelemetry\nGrafana · Alerts · Cost Dashboard] RESP --> EVAL[Eval Pipeline\nQuality Scoring] style GW fill:#0d2d3a,stroke:#00e5ff,color:#fff style SC fill:#1a2d1a,stroke:#2ecc71,color:#fff style RT fill:#1a1a3a,stroke:#9b59b6,color:#fff style OBS fill:#2d1a0d,stroke:#e67e22,color:#fff style EVAL fill:#1a2d1a,stroke:#2ecc71,color:#fff style CR fill:#1a2d0d,stroke:#2ecc71,color:#fff

    Figure 3: Day 90 production architecture — semantic cache, model routing, RAG, circuit breaker, observability


    What Not To Do

    These are the mistakes I see most often on first enterprise LLM projects:

    Sharing a database with the LLM service. Your AI service will have different scaling characteristics than your transactional DB. Run it separately from day one.

    Skipping evals. "It looks good" is not an evaluation. You need numbers. Build the eval set before you write production code, not after the first incident.

    No fallback model. LLM providers have outages. OpenAI has status page incidents multiple times per quarter. A fallback is not optional.

    Logging just the response. Log input tokens, output tokens, latency, cost, and model. You will need this data when your LLM bill is higher than expected.

    Overcomplicated day-one architecture. Agents, multi-step chains, and custom tool orchestration before you have proven product-market fit is engineering debt disguised as ambition.


    Summary: The Four-Phase Checklist

    Phase Timeline What to build
    Prototype Day 1–2 Single LLM call · system prompt · eval set · error handling
    RAG Day 3–7 Ingestion pipeline · pgvector · context injection · grounding
    Hardening Day 8–30 Circuit breaker · prompt cache · rate limiting · observability
    Scale Day 31–90 Model routing · semantic cache · eval pipeline · cost dashboard

    The engineers who build the best enterprise AI systems are the ones who treat LLMs like any other external dependency: unreliable, expensive, and absolutely worth abstracting behind a well-designed service boundary.

    Start simple. Measure everything. Add complexity only when your data tells you to.


    Key Takeaways

    • Day 1: one LLM call, a system prompt, and an eval set. Nothing else. Validate the core value before adding infrastructure
    • Circuit breakers + retry with exponential backoff are not optional — LLM providers have outages multiple times per quarter
    • Prompt caching (static prefix first, dynamic content last) reduces costs 70–90% at high volume — implement by Day 30
    • Model routing: send 70% of queries to the cheap model, 30% to the expensive one — typical cost reduction 75% with <2% quality loss
    • Evaluation is the most important investment — without a quality score, you fly blind through every change
    • The four-phase checklist (Day 1 → 7 → 30 → 90) maps directly to prototype → RAG → hardening → optimisation
    • "It looks good" is not an evaluation. You need numbers before shipping to production

    Practice Exercises

    Exercise 1 — Starter (2 hours): Build the Day 1 prototype. Use Spring AI or LangChain to create a single-endpoint LLM service with a well-structured system prompt. Write an evaluation set of 20 questions with expected answers. Run your eval and get a baseline accuracy score before writing any more code.

    Exercise 2 — Intermediate (1 day): Add RAG to your Day 1 prototype. Ingest 50 documents into pgvector. Measure how accuracy improves vs. the baseline. Then add a circuit breaker with Resilience4j and test it by pointing your service at a non-existent endpoint. Verify the fallback model activates correctly.

    Exercise 3 — Advanced (2–3 days): Implement the full Day 90 architecture. Add semantic response caching with Redis, model routing based on a complexity classifier, and a bi-weekly eval pipeline that runs automatically. Deploy with a canary at 10% traffic and set up an alert that auto-rolls-back if error rate exceeds 0.5% during the canary period.

    Ask about this article

    Get answers grounded in this post. AI-generated — based on this article, and may be imperfect.

    Scaled AI Weekly

    Enjoyed this? Get more like it every Monday.

    Real architecture decisions, LLMOps patterns that survive production, and engineering leadership advice — from 12+ years of building at enterprise scale. Free. No spam. Unsubscribe anytime.

    Join engineers building production AI systems