Context
Sector: Commercial ticketing platform. Role: Software Engineer, DevOps. Environment: Mission-critical ticketing services subjected to massive concurrency bursts during on-sale events and multi-year platform-migration backfills.
Challenge
- Stabilize logging, streaming, and storage layers under burst-traffic load that overwhelmed the prior architecture.
- Reduce scale-time and infrastructure cost without sacrificing reliability during on-sale events.
- Diagnose and remediate cross-system bottlenecks across Kafka, downstream storage, and log-ingest paths.
Architecture
Compute & Deployment
- Migrated mission-critical ticketing services from EC2 to spot-instance Kubernetes.
- Established CI/CD pipelines for repeatable deployments and rollback.
- Reduced node scale-time from 15 minutes to approximately 2 minutes (largely consumer group rebalance).
Logging Pipeline
- Redesigned Elasticsearch cluster topology: replaced 4 xlarge nodes with embedded masters with 12 medium data nodes and 3 dedicated masters at equivalent cost; isolated cluster-state traffic from data ingest.
- Patched Fluentbit to handle high-volume JSON log ingestion: expanded file-storage cache from 500MB to 2GB, tightened flush cadence, and authored a C source patch enlarging per-message buffer sizes for fat JSON payloads.
Streaming & Storage
- Diagnosed downstream DynamoDB write bottleneck causing Kafka Streams backpressure during on-sale bursts.
- Prototyped Cassandra and ScyllaDB alternative storage backends; designs subsequently adopted by the team.
Outcomes
- Sustained 42M messages per minute at peak during on-sale events and multi-year platform-migration backfills.
- Reduced node scale-time from 15 minutes to approximately 2 minutes.
- Reduced infrastructure cost by 40% via spot-instance Kubernetes migration.
- Eliminated downstream storage bottleneck through prototyped alternative storage architecture.