Context

Sector: Commercial ticketing platform. Role: Software Engineer, DevOps. Environment: Mission-critical ticketing services subjected to massive concurrency bursts during on-sale events and multi-year platform-migration backfills.

Challenge

  • Stabilize logging, streaming, and storage layers under burst-traffic load that overwhelmed the prior architecture.
  • Reduce scale-time and infrastructure cost without sacrificing reliability during on-sale events.
  • Diagnose and remediate cross-system bottlenecks across Kafka, downstream storage, and log-ingest paths.

Architecture

Compute & Deployment

  • Migrated mission-critical ticketing services from EC2 to spot-instance Kubernetes.
  • Established CI/CD pipelines for repeatable deployments and rollback.
  • Reduced node scale-time from 15 minutes to approximately 2 minutes (largely consumer group rebalance).

Logging Pipeline

  • Redesigned Elasticsearch cluster topology: replaced 4 xlarge nodes with embedded masters with 12 medium data nodes and 3 dedicated masters at equivalent cost; isolated cluster-state traffic from data ingest.
  • Patched Fluentbit to handle high-volume JSON log ingestion: expanded file-storage cache from 500MB to 2GB, tightened flush cadence, and authored a C source patch enlarging per-message buffer sizes for fat JSON payloads.

Streaming & Storage

  • Diagnosed downstream DynamoDB write bottleneck causing Kafka Streams backpressure during on-sale bursts.
  • Prototyped Cassandra and ScyllaDB alternative storage backends; designs subsequently adopted by the team.

Outcomes

  • Sustained 42M messages per minute at peak during on-sale events and multi-year platform-migration backfills.
  • Reduced node scale-time from 15 minutes to approximately 2 minutes.
  • Reduced infrastructure cost by 40% via spot-instance Kubernetes migration.
  • Eliminated downstream storage bottleneck through prototyped alternative storage architecture.