Enterprise AI Knowledge Platform
The Challenge
A Fortune 500 client needed semantic search across 2M+ enterprise documents with strict requirements:
- Sub-500ms response time at p99
- Data residency — all data must stay within their AWS VPC
- Multi-tenant isolation — 50+ business units, each with separate access controls
- No hallucination tolerance — answers must be fully grounded in retrieved content
Traditional keyword search was returning 40% irrelevant results. A dedicated vector database added $80K/year in infrastructure and operational overhead.
Architecture
Key Design Decisions
pgvector over Pinecone — saving $80K/year
The team initially planned Pinecone. I proposed pgvector on their existing PostgreSQL RDS instance:
- Cost: $0 additional — already paying for RDS
- Operations: Same team manages it, no new runbook
- Performance: 2M vectors at p99 < 80ms with HNSW index
- Trade-off accepted: Not auto-scalable past 5M vectors — acceptable for 12-month horizon
Hybrid Search: 23% better recall
Pure vector search missed exact product codes and internal terminology. BM25 + vector fusion with Reciprocal Rank Fusion (RRF) improved recall from 71% to 94%.
// Hybrid search implementation
public List<Document> hybridSearch(String query, String tenantId, int topK) {
// Dense retrieval
float[] queryEmbedding = embedder.embed(query);
List<Document> vectorResults = vectorStore.similaritySearch(
SearchRequest.query(query)
.withFilter("tenant_id = '" + tenantId + "'")
.withTopK(topK * 2)
);
// Sparse retrieval (BM25)
List<Document> keywordResults = fullTextSearch.search(query, tenantId, topK * 2);
// Reciprocal Rank Fusion
return reciprocalRankFusion(vectorResults, keywordResults, topK);
}
Semantic chunking preserves context
Fixed-size splitting destroyed meaning at boundaries. Semantic chunking at paragraph breaks with NLTK sentence tokenization improved answer quality by ~30%.
Scaling Bottlenecks
As traffic grew from hundreds to 10,000+ daily users, three bottlenecks emerged — each solved without re-architecting:
1. Embedding API rate limits. At peak, the OpenAI embedding endpoint throttled batch ingestion. Fix: moved embedding to an async queue (Kafka) with controlled concurrency, decoupling ingestion speed from API limits.
2. pgvector index rebuild time. As the corpus grew past 1M vectors, HNSW index rebuilds locked the table for minutes. Fix: switched to concurrent index builds (CREATE INDEX CONCURRENTLY) and partitioned the vector table by tenant, cutting rebuild impact to near-zero.
3. Cold-cache latency spikes. First queries after deploy were 4x slower while the Redis cache warmed. Fix: added a cache-warming job on deploy that pre-loads the top 500 historical queries, eliminating the cold-start penalty.
Observability Setup
Every query logged with: tenant_id, latency breakdown (embed/search/generate), token counts, cache hit/miss, relevance score. Grafana dashboard with p50/p95/p99 latency and cost-per-query trends.
The Solution
The final architecture delivered enterprise-grade semantic search at a fraction of the projected cost. By choosing pgvector over a dedicated vector database, implementing hybrid search, and solving each scaling bottleneck with targeted fixes rather than re-platforming, the system scaled 100x in users while staying on the original infrastructure budget.
Outcome
| Metric | Before | After |
|---|---|---|
| Search relevance | 71% | 94% |
| p99 latency | 2,100ms | 380ms |
| Vector infra cost | $80K/yr planned | $0 |
| Daily active users | N/A | 10,000+ |
| Hallucination rate | N/A | < 0.3% |