Cloud-Native Microservices Platform

The Challenge

A legacy monolith handling payment events was causing cascading failures during peak traffic. A single Black Friday spike brought down the entire system. The business needed:

Process 50M+ events/day without downtime
Sub-second latency for payment status updates
Independent deployability — 8 teams, no deployment coordination
Zero-downtime deployments for a 24/7 payment system

Architecture

flowchart TD subgraph Producers["Event Producers"] PS[Payment Service] OS[Order Service] NS[Notification Service] end subgraph Kafka["Kafka Cluster (Confluent)"] T1[payments.transactions.completed] T2[orders.status.updated] T3[notifications.email.queued] DLQ[Dead Letter Queues] end subgraph Consumers["Consumer Microservices"] LEDGER[Ledger Service] ANALYTICS[Analytics Service] FRAUD[Fraud Detection] NOTIFY[Notification Dispatcher] end subgraph Infra["Infrastructure (AWS)"] EKS[EKS Kubernetes] RDS[(RDS PostgreSQL)] REDIS[(ElastiCache Redis)] CW[CloudWatch + Grafana] end PS --> T1 OS --> T2 NS --> T3 T1 --> LEDGER T1 --> ANALYTICS T1 --> FRAUD T2 --> ANALYTICS T3 --> NOTIFY T1 --> DLQ T2 --> DLQ LEDGER --> RDS ANALYTICS --> RDS FRAUD --> REDIS EKS --> LEDGER EKS --> ANALYTICS EKS --> FRAUD EKS --> NOTIFY style Kafka fill:#2d1a0d,stroke:#e67e22,color:#fff style Consumers fill:#0d2d3a,stroke:#06b6d4,color:#fff style Infra fill:#0d1a2d,stroke:#3b82f6,color:#fff

Key Design Decisions

Kafka partition strategy

Partitioned by payment_id — all events for a payment land in the same partition, preserving order. 20 partitions per topic calculated from peak throughput target.

Dead Letter Queue pattern

Every consumer has a .dlq topic. Failed messages (after 3 retries) land there with full error context. A separate monitor alerts on-call and enables manual replay without touching the main pipeline.

Consumer group configuration

Static group membership (group.instance.id) prevents rebalancing storms during Kubernetes pod restarts. Saved ~8 minutes of stalled processing per rolling deployment.

Deployment Pipeline

flowchart LR DEV[Developer Push] --> GH[GitHub Actions CI] GH --> TEST[Unit + Integration Tests] TEST --> BUILD[Docker Build + Push to ECR] BUILD --> STAGE[Deploy to Staging EKS] STAGE --> SMOKE[Smoke Tests] SMOKE --> PROD[Canary Deploy to Prod] PROD --> MONITOR[5min Canary Observation] MONITOR --> FULL[100% Traffic Rollout] style PROD fill:#1a2d0d,stroke:#10b981,color:#fff style MONITOR fill:#2d1a0d,stroke:#e67e22,color:#fff

Canary at 5% of traffic for 5 minutes. Auto-rollback triggers if error rate exceeds 0.1% or p95 latency exceeds 2x baseline.

Scaling Bottlenecks

Pushing the platform from thousands to millions of events per day exposed three hard limits — each demanded a targeted fix rather than throwing more hardware at it:

1. Consumer lag under burst traffic. During payment spikes, a single consumer group fell behind by minutes. Fix: scaled partitions to 20 per topic and matched consumer instances 1:1, enabling true parallel processing. Lag dropped from minutes to sub-second.

2. Database connection exhaustion. Each microservice held its own pool; at scale the shared Postgres hit max connections and rejected writes. Fix: introduced PgBouncer in transaction-pooling mode, collapsing thousands of app connections into a small server-side pool.

3. Rebalancing storms on deploy. Rolling Kubernetes deploys triggered Kafka consumer rebalances, stalling processing for minutes each release. Fix: static group membership (group.instance.id) plus longer session timeouts kept partitions assigned across pod restarts.

The Solution

The result is an event-driven platform that processes millions of events per day with sub-second consumer lag and zero-downtime deploys. Every scaling limit was solved by understanding the system's actual constraint — partitions, connection pooling, group membership — rather than over-provisioning. The architecture now scales horizontally simply by adding partitions and matching consumers.

Outcome

Metric	Legacy Monolith	New Platform
Peak throughput	~500K events/hr	2M+ events/hr
Deployment time	4 hours coordinated	12 minutes per service
Incidents per month	8-12 (cascading)	0-1 (isolated)
Uptime	98.2%	99.95%
Team independence	0 (shared monolith)	8 teams fully independent