Real-Time Estimation — 700ms p99 → 25ms

Cut p99 on the real-time ad estimation service from 700ms to 25ms — same endpoint, more traffic — through multi-tier caching, async refresh, request deduplication, batch inference, and graceful fallback when the model service was degraded.

Context

Real-time ad estimation called an ML model on every request — the hot path for ads served to every Cafebazaar user (51M total). The naive implementation gave us p99 ~700ms. That number was untenable: a decision has to happen before the page renders, and at that latency we were already dropping ad-fill opportunities. More traffic was coming. The fix needed to hold under load, not just in a controlled test.

The five levers

1. Multi-tier caching. Model predictions are cached at several TTL horizons, keyed by the ad and context features that the model would have used. Short TTLs for volatile signals, longer TTLs for stable ones — predictions over different age levels matter because a day-old estimate for a low-traffic slot is still better than waiting 700ms for a fresh one. The model service stops being on the common path entirely for most requests.

2. Async refresh before expiry. Cache misses are expensive. The standard approach — let the TTL expire and serve the next caller a cold miss — creates a latency cliff that shows up in p99. Instead, we refresh cache entries asynchronously just before they expire. The trade-off is a brief window where an entry might be slightly stale; for ad predictions, that’s acceptable. The cliff disappears.

3. Request deduplication. Multiple in-flight requests for the same prediction within the same time window share a single model call. The first caller kicks off the inference; the rest wait on the same future and get the same result. This cuts both load and tail latency — during traffic spikes, the deduplication window absorbs bursts that would otherwise hammer the model service. Tuning the window size is non-trivial: too wide and latency rises; too narrow and you get no benefit.

4. Graceful fallback. When the model service is degraded or slow, fall back to deterministic heuristic estimates rather than failing the request. The heuristic is calibrated against the model’s recent average output, so the fallback gives plausible fill rather than no fill. Revenue is salvaged; service availability holds. The key constraint is that the heuristic has to be cheap — a fallback that takes 400ms doesn’t help.

5. Batch inference. For requests where the latency budget was loose enough, we batched model calls instead of firing one per request. Reduces per-request overhead on the model service side, improves throughput, and lets the model service work at a more efficient operating point. Not applicable on the hot path itself, but useful for background scoring.

What was hard

Calibrating the fallback heuristic. The goal was that the model service degrading wouldn’t produce a measurable shift in ad-fill or revenue. That requires the fallback to stay within a useful range — if it consistently over-predicts or under-predicts relative to the model, you either over-fill (poor user experience) or under-fill (lost revenue). We calibrated the heuristic offline against a rolling window of model outputs, and kept a monitor on fallback-period fill rates. The first few calibrations were off; it took three iteration cycles to get it stable enough that a 10-minute model outage didn’t register in the revenue dashboard.

Cache warm-up after deploys. Fresh deployments started with empty caches, which produced latency spikes in the first few minutes as the model service absorbed the load that the cache would normally handle. We addressed this with a pre-warm step on deploy: before traffic shifted to the new instance, a shadow load generator ran through a sample of recent request patterns and seeded the cache. The spikes shrank from 3–4 minutes to under 30 seconds.

Outcome

	Before	After
p99 response time	700 ms	25 ms
Endpoint	Same	Same
Traffic	Lower	Higher (more ads shown)
Service availability under model degradation	Drops with the model	Held via fallback

The comparison is apples-to-apples: same endpoint, more load. The reduction isn’t “old uncached worst case → new cached typical.” It’s the same call doing the same job, faster.

Stack

Python, Redis (multi-tier cache), in-house ML serving, Prometheus + Grafana.