Skip to main content
Kayvan Mazaheri
← back

Ad Platform Rebuild — 10× revenue over two years

Moved Cafebazaar's ad placement from a brittle monolith to a microservices platform without downtime. Architecture, pricing, ad-placement A/B testing, ML-model A/B testing, and graceful ML fallbacks all contributed — the headline outcome was 10× ad revenue sustained over two years.

stack Microservices · Python · Go · Kafka · Redis · Kubernetes year 2025

Context

Cafebazaar (Iran’s largest Android store, 51M users) ran ads in front of every user, every session. The ad system was a monolith that had grown past the point of being safe to change, and it was losing money on a regular cadence: services went down every other day, ad-fill rates suffered, and revenue suffered with them. The product team wanted to push the platform harder — more ad formats, smarter placement, sharper pricing. The platform first needed to stop falling over.

The underlying problem was compounding: unreliable infrastructure made engineers cautious about shipping, which made the backlog worse, which increased pressure to cut corners. Getting out of that loop required treating the platform as a product with its own roadmap, not just a cost center that kept the lights on.

What I owned

Full lifecycle, from architecture to weekly operation: the migration plan, the team (10+ engineers), the experimentation framework, and the production numbers. The role was tech lead with both architectural and operational accountability — I wrote the design docs, ran the incident reviews, and was the person the CTO called when the numbers moved unexpectedly.

What changed

Architecture. We ran a strangler-fig migration rather than a big-bang cutover. New service boundaries were carved out incrementally; traffic was moved in phases; the monolith shrank as the new services expanded. One counterintuitive decision: we also consolidated during the migration. The old system had more services than the team size could realistically maintain — someone’s definition of “microservices” had been applied too literally. The new shape was fewer, better-scoped services with clearer ownership and fewer cross-service calls on the hot path.

Event backbone. Ad events (impressions, clicks, conversions) moved to Kafka with partitioned topics for parallelism and Redis-tracked idempotency keys for deduplication. Batched DB writes downstream kept write amplification manageable. At-least-once delivery with idempotency keys gave us effectively exactly-once at the application layer. The pipeline held a daily peak of 50K events/sec without incident.

Pricing. Doubled minimum ad pricing. The reason this is worth calling out: the old system couldn’t hold the change. Any significant pricing update required careful coordination and carried real revenue risk because you couldn’t confidently redeploy the system. Doing it cleanly was itself evidence that the migration worked.

Ad-placement A/B testing. Built the test rig that let us show more ads in slots where users didn’t punish us for it. Some positions that looked like strong real estate weren’t; some that looked too aggressive weren’t either. Before this, placement decisions were gut-feel. After, they were data-driven and defensible.

ML-model A/B testing. The same framework — different lever. We used it to run model variants for predicting ad performance in real time. Better predictions meant more relevant ads shown, better conversion rates, better unit economics for advertisers.

ML fallback path. Real-time ad serving has a latency budget. If the model doesn’t respond within it, the request can’t wait. Deterministic fallback estimates let the platform make a decision when the model service is slow or degraded, so revenue stays salvaged instead of dropping to zero.

What was hard

Live migration without revenue impact. Cutting traffic between old and new services meant maintaining the same per-ad billing semantics throughout — a discrepancy in how an impression was counted between the two systems would have surfaced as a revenue anomaly. We ran a dual-write/dual-read phase where both systems processed every event, then reconciled downstream. The reconciliation reports were how we knew it was safe to flip: when the delta between old and new stayed under 0.1% across multiple billing cycles, we moved the traffic.

Service-boundary calls. Splitting a monolith “up” into many services is the common move. Consolidating services “down” while a monolith migration is already in flight is rarer and harder — you’re making a structural decision about ownership at a moment when everyone is already mid-change. The shape we landed on was roughly: one service per domain concept (ad selection, bidding, event tracking, creative serving), with a deliberate rule against cross-service synchronous calls on the hot path. Async was fine; synchronous chained calls were a flag.

A/B-test isolation. Running placement experiments and model experiments simultaneously meant any observed change in revenue or fill rate was ambiguous unless the two experiments were cleanly separated. The framework handled this with exclusive assignment: a user in a placement experiment was excluded from model-experiment variants, and vice versa. Getting product, ML, and infra to accept that constraint was a design conversation, not just a code one.

Outcome

BeforeAfter
Ad revenueBaseline10× over 2 years
Service incidents~Every other dayRare, isolated
Peak event throughputLimited50K events/sec daily peak
Min ad pricingBaseline
Team velocityBaseline~30% sprint-velocity lift

The honest version of the headline: services stopped going down every other day, pricing doubled, we got smart about which ads to show, and we built fallbacks so a slow model couldn’t kill revenue. That combination compounds — over two years it compounds to 10×.

Stack

Python, Go, Kafka, Redis, Kubernetes, PostgreSQL, in-house ML serving, Grafana + Prometheus + Sentry.