Context
The ad platform produces a high volume of interaction events — impressions, clicks, conversions — and the downstream systems need every single one. Out-of-order arrivals, retries from upstream, and infrastructure hiccups had been quietly corrupting the picture: some events lost, others double-counted. The problem wasn’t just data quality; incorrect event counts meant incorrect billing, and that has direct revenue consequences.
The design
Kafka as the durable backbone. Partitioned topics give parallel processing without sacrificing order within a partition. Replication handles node failures. The partitioning key was the ad impression ID, which meant all events for the same impression arrived at the same partition in order — a deliberate choice, since downstream aggregation depends on seeing click after impression, not before.
Idempotency keys and Redis deduplication. Every event carries a deterministic idempotency key derived from its source fields. Before a consumer writes anything to the database, it checks Redis: if the key is already present, the event is a duplicate and gets dropped. This pushes us from at-least-once to effectively exactly-once at the application layer. The Redis TTL on idempotency keys was set longer than the realistic redelivery window, so a genuine retry within that window is always caught; a record that ages past it is genuinely new.
Batched DB writes. Consumers accumulate events and flush in batches rather than writing one row at a time. In production this cut write amplification by roughly an order of magnitude. The trade-off is a small increase in latency between event arrival and DB visibility — acceptable for the analytics workload, not acceptable for the real-time serving path (which didn’t go through this pipeline).
Dead-letter queues. Events that fail processing — malformed payloads, schema mismatches, transient downstream errors — go to a DLQ rather than being silently dropped or blocking the main topic. The DLQ had its own consumer that ran on a slower cadence and alerted on volume spikes; most poison pills turned out to be schema issues caught during deploys.
Edge cases
Ordering within partitions versus across them. Partitioning by impression ID preserved intra-impression ordering, but it meant events for different impressions could arrive out of order relative to each other. Downstream aggregation queries had to be written to handle this: no assumption of global ordering, only local ordering within an impression’s event sequence. The first version of the analytics queries assumed global order and was wrong; fixing that required a query rewrite and a backfill.
Consumer lag during deploys. Restarting consumers during a deploy created a lag window: Kafka kept accumulating events while the new consumer started up. On a 50K events/sec pipeline, a 30-second restart window meant 1.5M events queued. The deploy pattern we landed on was drain-then-restart: signal the running consumer to stop accepting new messages, let it flush its current batch to the database, commit offsets, then allow the deploy to proceed. Consumer lag went from tens of seconds of events to near-zero on every deploy.
DLQ poison pills. Some events failed repeatedly and would have blocked the DLQ consumer in a retry loop if we hadn’t set a maximum retry count. After N retries, events were moved to a separate “dead” queue and an alert fired. The alert cadence mattered: too noisy and engineers ignored it; too quiet and a systematic schema problem sat unaddressed for hours. We ended up alerting on rate rather than count — “more than 5 events/minute hitting dead queue” was the threshold that caught real problems without false positives.
Outcome
| Before | After | |
|---|---|---|
| Peak throughput | Limited | 50K events/sec daily peak |
| Processing latency | Baseline | ~80% lower |
| Duplicate rate | High | Near zero |
| Data loss | Measurable | Negligible |
50K/sec is what we hit in production every day. The system was never load-tested past that, so the ceiling is probably higher.
Stack
Kafka, Redis (deduplication), Python (consumers), PostgreSQL.