Home » Selected Engagements

Production Resilience for On-Sale Burst Traffic

Context

Sector: Commercial ticketing platform. Role: Software Engineer, DevOps. Environment: Mission-critical ticketing services subjected to massive concurrency bursts during on-sale events and multi-year platform-migration backfills.

Challenge

Stabilize logging, streaming, and storage layers under burst-traffic load that overwhelmed the prior architecture.
Reduce scale-time and infrastructure cost without sacrificing reliability during on-sale events.
Diagnose and remediate cross-system bottlenecks across Kafka, downstream storage, and log-ingest paths.

Architecture

Compute & Deployment

Migrated mission-critical ticketing services from EC2 to spot-instance Kubernetes.
Established CI/CD pipelines for repeatable deployments and rollback.
Reduced node scale-time from 15 minutes to approximately 2 minutes (largely consumer group rebalance).

Logging Pipeline

Redesigned Elasticsearch cluster topology: replaced 4 xlarge nodes with embedded masters with 12 medium data nodes and 3 dedicated masters at equivalent cost; isolated cluster-state traffic from data ingest.
Patched Fluentbit to handle high-volume JSON log ingestion: expanded file-storage cache from 500MB to 2GB, tightened flush cadence, and authored a C source patch enlarging per-message buffer sizes for fat JSON payloads.

Streaming & Storage

Diagnosed downstream DynamoDB write bottleneck causing Kafka Streams backpressure during on-sale bursts.
Prototyped Cassandra and ScyllaDB alternative storage backends; designs subsequently adopted by the team.

Outcomes

Sustained 42M messages per minute at peak during on-sale events and multi-year platform-migration backfills.
Reduced node scale-time from 15 minutes to approximately 2 minutes.
Reduced infrastructure cost by 40% via spot-instance Kubernetes migration.
Eliminated downstream storage bottleneck through prototyped alternative storage architecture.