Context
Real-time ad estimation called an ML model on every request — the hot path for ads served to every Cafebazaar user (51M total). The naive implementation gave us p99 ~700ms. That number was untenable: a decision has to happen before the page renders, and at that latency we were already dropping ad-fill opportunities. More traffic was coming. The fix needed to hold under load, not just in a controlled test.
The five levers
1. Multi-tier caching. Model predictions are cached at several TTL horizons, keyed by the ad and context features that the model would have used. Short TTLs for volatile signals, longer TTLs for stable ones — predictions over different age levels matter because a day-old estimate for a low-traffic slot is still better than waiting 700ms for a fresh one. The model service stops being on the common path entirely for most requests.
2. Async refresh before expiry. Cache misses are expensive. The standard approach — let the TTL expire and serve the next caller a cold miss — creates a latency cliff that shows up in p99. Instead, we refresh cache entries asynchronously just before they expire. The trade-off is a brief window where an entry might be slightly stale; for ad predictions, that’s acceptable. The cliff disappears.
3. Request deduplication. Multiple in-flight requests for the same prediction within the same time window share a single model call. The first caller kicks off the inference; the rest wait on the same future and get the same result. This cuts both load and tail latency — during traffic spikes, the deduplication window absorbs bursts that would otherwise hammer the model service. Tuning the window size is non-trivial: too wide and latency rises; too narrow and you get no benefit.
4. Graceful fallback. When the model service is degraded or slow, fall back to deterministic heuristic estimates rather than failing the request. The heuristic is calibrated against the model’s recent average output, so the fallback gives plausible fill rather than no fill. Revenue is salvaged; service availability holds. The key constraint is that the heuristic has to be cheap — a fallback that takes 400ms doesn’t help.
5. Batch inference. For requests where the latency budget was loose enough, we batched model calls instead of firing one per request. Reduces per-request overhead on the model service side, improves throughput, and lets the model service work at a more efficient operating point. Not applicable on the hot path itself, but useful for background scoring.
What was hard
Calibrating the fallback heuristic. The goal was that the model service degrading wouldn’t produce a measurable shift in ad-fill or revenue. That requires the fallback to stay within a useful range — if it consistently over-predicts or under-predicts relative to the model, you either over-fill (poor user experience) or under-fill (lost revenue). We calibrated the heuristic offline against a rolling window of model outputs, and kept a monitor on fallback-period fill rates. The first few calibrations were off; it took three iteration cycles to get it stable enough that a 10-minute model outage didn’t register in the revenue dashboard.
Cache warm-up after deploys. Fresh deployments started with empty caches, which produced latency spikes in the first few minutes as the model service absorbed the load that the cache would normally handle. We addressed this with a pre-warm step on deploy: before traffic shifted to the new instance, a shadow load generator ran through a sample of recent request patterns and seeded the cache. The spikes shrank from 3–4 minutes to under 30 seconds.
Outcome
| Before | After | |
|---|---|---|
| p99 response time | 700 ms | 25 ms |
| Endpoint | Same | Same |
| Traffic | Lower | Higher (more ads shown) |
| Service availability under model degradation | Drops with the model | Held via fallback |
The comparison is apples-to-apples: same endpoint, more load. The reduction isn’t “old uncached worst case → new cached typical.” It’s the same call doing the same job, faster.
Stack
Python, Redis (multi-tier cache), in-house ML serving, Prometheus + Grafana.