LLM / RAG
    June 2026

    Enterprise AI Knowledge Platform

    Multi-tenant RAG platform serving 10,000+ enterprise users with hybrid search, sub-400ms latency, and zero additional infrastructure cost.

    OpenAILangChainpgvectorKubernetesSpring AIRedis
    Share

    Enterprise AI Knowledge Platform

    The Challenge

    A Fortune 500 client needed semantic search across 2M+ enterprise documents with strict requirements:

    • Sub-500ms response time at p99
    • Data residency — all data must stay within their AWS VPC
    • Multi-tenant isolation — 50+ business units, each with separate access controls
    • No hallucination tolerance — answers must be fully grounded in retrieved content

    Traditional keyword search was returning 40% irrelevant results. A dedicated vector database added $80K/year in infrastructure and operational overhead.

    Architecture

    flowchart TD subgraph Client["Client Layer"] UI[Web App / API Consumers]:::ext end subgraph Gateway["API Gateway (Spring Boot)"] AUTH[Auth + Tenant Resolver]:::svc RL[Rate Limiter]:::svc end subgraph RAG["RAG Pipeline"] QE[Query Embedding\ntext-embedding-3-small]:::ai HS[Hybrid Search\nVector + BM25 Fusion]:::ai RR[Re-Ranker\nCross-Encoder]:::ai GEN[Generation\nGPT-4o with Citations]:::ai end subgraph Storage["Storage Layer"] PG@{ icon: "logos:postgresql", form: "square", label: "pgvector" } CACHE@{ icon: "logos:redis", form: "square", label: "Redis cache" } end subgraph Ingest["Ingestion Pipeline"] LOAD[Document Loader\nPDF, Confluence, S3]:::svc CHUNK[Semantic Chunker\n512 tokens, 64 overlap]:::svc EMBED[Batch Embedder]:::svc STORE[Vector Store Writer]:::svc end UI --> AUTH AUTH --> RL RL --> QE QE --> HS HS --> PG PG --> RR RR --> CACHE CACHE --> GEN GEN --> UI LOAD --> CHUNK --> EMBED --> STORE --> PG class PG,CACHE logo classDef ai fill:#241844,stroke:#a855f7,color:#fff classDef svc fill:#06303a,stroke:#22d3ee,color:#fff classDef ext fill:#222838,stroke:#94a3b8,color:#fff classDef logo fill:#0b1220,stroke:#475569,color:#e2e8f0 style RAG fill:#0d2d3a,stroke:#06b6d4,color:#fff style Storage fill:#0d2d1a,stroke:#10b981,color:#fff style Ingest fill:#1a0d2d,stroke:#9333ea,color:#fff

    Key Design Decisions

    pgvector over Pinecone — saving $80K/year

    The team initially planned Pinecone. I proposed pgvector on their existing PostgreSQL RDS instance:

    • Cost: $0 additional — already paying for RDS
    • Operations: Same team manages it, no new runbook
    • Performance: 2M vectors at p99 < 80ms with HNSW index
    • Trade-off accepted: Not auto-scalable past 5M vectors — acceptable for 12-month horizon

    Hybrid Search: 23% better recall

    Pure vector search missed exact product codes and internal terminology. BM25 + vector fusion with Reciprocal Rank Fusion (RRF) improved recall from 71% to 94%.

    // Hybrid search implementation
    public List<Document> hybridSearch(String query, String tenantId, int topK) {
        // Dense retrieval
        float[] queryEmbedding = embedder.embed(query);
        List<Document> vectorResults = vectorStore.similaritySearch(
            SearchRequest.query(query)
                .withFilter("tenant_id = '" + tenantId + "'")
                .withTopK(topK * 2)
        );
    
        // Sparse retrieval (BM25)
        List<Document> keywordResults = fullTextSearch.search(query, tenantId, topK * 2);
    
        // Reciprocal Rank Fusion
        return reciprocalRankFusion(vectorResults, keywordResults, topK);
    }

    Semantic chunking preserves context

    Fixed-size splitting destroyed meaning at boundaries. Semantic chunking at paragraph breaks with NLTK sentence tokenization improved answer quality by ~30%.

    Scaling Bottlenecks

    As traffic grew from hundreds to 10,000+ daily users, three bottlenecks emerged — each solved without re-architecting:

    1. Embedding API rate limits. At peak, the OpenAI embedding endpoint throttled batch ingestion. Fix: moved embedding to an async queue (Kafka) with controlled concurrency, decoupling ingestion speed from API limits.

    2. pgvector index rebuild time. As the corpus grew past 1M vectors, HNSW index rebuilds locked the table for minutes. Fix: switched to concurrent index builds (CREATE INDEX CONCURRENTLY) and partitioned the vector table by tenant, cutting rebuild impact to near-zero.

    3. Cold-cache latency spikes. First queries after deploy were 4x slower while the Redis cache warmed. Fix: added a cache-warming job on deploy that pre-loads the top 500 historical queries, eliminating the cold-start penalty.

    Observability Setup

    Every query logged with: tenant_id, latency breakdown (embed/search/generate), token counts, cache hit/miss, relevance score. Grafana dashboard with p50/p95/p99 latency and cost-per-query trends.

    The Solution

    The final architecture delivered enterprise-grade semantic search at a fraction of the projected cost. By choosing pgvector over a dedicated vector database, implementing hybrid search, and solving each scaling bottleneck with targeted fixes rather than re-platforming, the system scaled 100x in users while staying on the original infrastructure budget.

    Outcome

    Metric Before After
    Search relevance 71% 94%
    p99 latency 2,100ms 380ms
    Vector infra cost $80K/yr planned $0
    Daily active users N/A 10,000+
    Hallucination rate N/A < 0.3%