Enterprise AI Knowledge Platform

The Challenge

A Fortune 500 client needed semantic search across 2M+ enterprise documents with strict requirements:

Sub-500ms response time at p99
Data residency — all data must stay within their AWS VPC
Multi-tenant isolation — 50+ business units, each with separate access controls
No hallucination tolerance — answers must be fully grounded in retrieved content

Traditional keyword search was returning 40% irrelevant results. A dedicated vector database added $80K/year in infrastructure and operational overhead.

Architecture

flowchart TD subgraph Client["Client Layer"] UI[Web App / API Consumers]:::ext end subgraph Gateway["API Gateway (Spring Boot)"] AUTH[Auth + Tenant Resolver]:::svc RL[Rate Limiter]:::svc end subgraph RAG["RAG Pipeline"] QE[Query Embedding\ntext-embedding-3-small]:::ai HS[Hybrid Search\nVector + BM25 Fusion]:::ai RR[Re-Ranker\nCross-Encoder]:::ai GEN[Generation\nGPT-4o with Citations]:::ai end subgraph Storage["Storage Layer"] PG@{ icon: "logos:postgresql", form: "square", label: "pgvector" } CACHE@{ icon: "logos:redis", form: "square", label: "Redis cache" } end subgraph Ingest["Ingestion Pipeline"] LOAD[Document Loader\nPDF, Confluence, S3]:::svc CHUNK[Semantic Chunker\n512 tokens, 64 overlap]:::svc EMBED[Batch Embedder]:::svc STORE[Vector Store Writer]:::svc end UI --> AUTH AUTH --> RL RL --> QE QE --> HS HS --> PG PG --> RR RR --> CACHE CACHE --> GEN GEN --> UI LOAD --> CHUNK --> EMBED --> STORE --> PG class PG,CACHE logo classDef ai fill:#241844,stroke:#a855f7,color:#fff classDef svc fill:#06303a,stroke:#22d3ee,color:#fff classDef ext fill:#222838,stroke:#94a3b8,color:#fff classDef logo fill:#0b1220,stroke:#475569,color:#e2e8f0 style RAG fill:#0d2d3a,stroke:#06b6d4,color:#fff style Storage fill:#0d2d1a,stroke:#10b981,color:#fff style Ingest fill:#1a0d2d,stroke:#9333ea,color:#fff

Key Design Decisions

pgvector over Pinecone — saving $80K/year

The team initially planned Pinecone. I proposed pgvector on their existing PostgreSQL RDS instance:

Cost: $0 additional — already paying for RDS
Operations: Same team manages it, no new runbook
Performance: 2M vectors at p99 < 80ms with HNSW index
Trade-off accepted: Not auto-scalable past 5M vectors — acceptable for 12-month horizon

Hybrid Search: 23% better recall

Pure vector search missed exact product codes and internal terminology. BM25 + vector fusion with Reciprocal Rank Fusion (RRF) improved recall from 71% to 94%.

// Hybrid search implementation
public List<Document> hybridSearch(String query, String tenantId, int topK) {
    // Dense retrieval
    float[] queryEmbedding = embedder.embed(query);
    List<Document> vectorResults = vectorStore.similaritySearch(
        SearchRequest.query(query)
            .withFilter("tenant_id = '" + tenantId + "'")
            .withTopK(topK * 2)
    );

    // Sparse retrieval (BM25)
    List<Document> keywordResults = fullTextSearch.search(query, tenantId, topK * 2);

    // Reciprocal Rank Fusion
    return reciprocalRankFusion(vectorResults, keywordResults, topK);
}

Semantic chunking preserves context

Fixed-size splitting destroyed meaning at boundaries. Semantic chunking at paragraph breaks with NLTK sentence tokenization improved answer quality by ~30%.

Scaling Bottlenecks

As traffic grew from hundreds to 10,000+ daily users, three bottlenecks emerged — each solved without re-architecting:

1. Embedding API rate limits. At peak, the OpenAI embedding endpoint throttled batch ingestion. Fix: moved embedding to an async queue (Kafka) with controlled concurrency, decoupling ingestion speed from API limits.

2. pgvector index rebuild time. As the corpus grew past 1M vectors, HNSW index rebuilds locked the table for minutes. Fix: switched to concurrent index builds (CREATE INDEX CONCURRENTLY) and partitioned the vector table by tenant, cutting rebuild impact to near-zero.

3. Cold-cache latency spikes. First queries after deploy were 4x slower while the Redis cache warmed. Fix: added a cache-warming job on deploy that pre-loads the top 500 historical queries, eliminating the cold-start penalty.

Observability Setup

Every query logged with: tenant_id, latency breakdown (embed/search/generate), token counts, cache hit/miss, relevance score. Grafana dashboard with p50/p95/p99 latency and cost-per-query trends.

The Solution

The final architecture delivered enterprise-grade semantic search at a fraction of the projected cost. By choosing pgvector over a dedicated vector database, implementing hybrid search, and solving each scaling bottleneck with targeted fixes rather than re-platforming, the system scaled 100x in users while staying on the original infrastructure budget.

Outcome

Metric	Before	After
Search relevance	71%	94%
p99 latency	2,100ms	380ms
Vector infra cost	$80K/yr planned	$0
Daily active users	N/A	10,000+
Hallucination rate	N/A	< 0.3%