Cloud
    June 2026

    Cloud-Native Microservices Platform

    Event-driven microservices platform on AWS processing 50M+ events per day with sub-second latency and 99.95% uptime.

    AWSKafkaJavaSpring BootTerraformKubernetes
    Share

    Cloud-Native Microservices Platform

    The Challenge

    A legacy monolith handling payment events was causing cascading failures during peak traffic. A single Black Friday spike brought down the entire system. The business needed:

    • Process 50M+ events/day without downtime
    • Sub-second latency for payment status updates
    • Independent deployability — 8 teams, no deployment coordination
    • Zero-downtime deployments for a 24/7 payment system

    Architecture

    flowchart TD subgraph Producers["Event Producers"] PS[Payment Service] OS[Order Service] NS[Notification Service] end subgraph Kafka["Kafka Cluster (Confluent)"] T1[payments.transactions.completed] T2[orders.status.updated] T3[notifications.email.queued] DLQ[Dead Letter Queues] end subgraph Consumers["Consumer Microservices"] LEDGER[Ledger Service] ANALYTICS[Analytics Service] FRAUD[Fraud Detection] NOTIFY[Notification Dispatcher] end subgraph Infra["Infrastructure (AWS)"] EKS[EKS Kubernetes] RDS[(RDS PostgreSQL)] REDIS[(ElastiCache Redis)] CW[CloudWatch + Grafana] end PS --> T1 OS --> T2 NS --> T3 T1 --> LEDGER T1 --> ANALYTICS T1 --> FRAUD T2 --> ANALYTICS T3 --> NOTIFY T1 --> DLQ T2 --> DLQ LEDGER --> RDS ANALYTICS --> RDS FRAUD --> REDIS EKS --> LEDGER EKS --> ANALYTICS EKS --> FRAUD EKS --> NOTIFY style Kafka fill:#2d1a0d,stroke:#e67e22,color:#fff style Consumers fill:#0d2d3a,stroke:#06b6d4,color:#fff style Infra fill:#0d1a2d,stroke:#3b82f6,color:#fff

    Key Design Decisions

    Kafka partition strategy

    Partitioned by payment_id — all events for a payment land in the same partition, preserving order. 20 partitions per topic calculated from peak throughput target.

    Dead Letter Queue pattern

    Every consumer has a .dlq topic. Failed messages (after 3 retries) land there with full error context. A separate monitor alerts on-call and enables manual replay without touching the main pipeline.

    Consumer group configuration

    Static group membership (group.instance.id) prevents rebalancing storms during Kubernetes pod restarts. Saved ~8 minutes of stalled processing per rolling deployment.

    Deployment Pipeline

    flowchart LR DEV[Developer Push] --> GH[GitHub Actions CI] GH --> TEST[Unit + Integration Tests] TEST --> BUILD[Docker Build + Push to ECR] BUILD --> STAGE[Deploy to Staging EKS] STAGE --> SMOKE[Smoke Tests] SMOKE --> PROD[Canary Deploy to Prod] PROD --> MONITOR[5min Canary Observation] MONITOR --> FULL[100% Traffic Rollout] style PROD fill:#1a2d0d,stroke:#10b981,color:#fff style MONITOR fill:#2d1a0d,stroke:#e67e22,color:#fff

    Canary at 5% of traffic for 5 minutes. Auto-rollback triggers if error rate exceeds 0.1% or p95 latency exceeds 2x baseline.

    Scaling Bottlenecks

    Pushing the platform from thousands to millions of events per day exposed three hard limits — each demanded a targeted fix rather than throwing more hardware at it:

    1. Consumer lag under burst traffic. During payment spikes, a single consumer group fell behind by minutes. Fix: scaled partitions to 20 per topic and matched consumer instances 1:1, enabling true parallel processing. Lag dropped from minutes to sub-second.

    2. Database connection exhaustion. Each microservice held its own pool; at scale the shared Postgres hit max connections and rejected writes. Fix: introduced PgBouncer in transaction-pooling mode, collapsing thousands of app connections into a small server-side pool.

    3. Rebalancing storms on deploy. Rolling Kubernetes deploys triggered Kafka consumer rebalances, stalling processing for minutes each release. Fix: static group membership (group.instance.id) plus longer session timeouts kept partitions assigned across pod restarts.

    The Solution

    The result is an event-driven platform that processes millions of events per day with sub-second consumer lag and zero-downtime deploys. Every scaling limit was solved by understanding the system's actual constraint — partitions, connection pooling, group membership — rather than over-provisioning. The architecture now scales horizontally simply by adding partitions and matching consumers.

    Outcome

    Metric Legacy Monolith New Platform
    Peak throughput ~500K events/hr 2M+ events/hr
    Deployment time 4 hours coordinated 12 minutes per service
    Incidents per month 8-12 (cascading) 0-1 (isolated)
    Uptime 98.2% 99.95%
    Team independence 0 (shared monolith) 8 teams fully independent