P2P Payment Fraud Detection at Cafebazaar

Rule-based fraud scoring on the peer-to-peer payment platform — cut manual review workload by ~70% by auto-deciding the obvious cases and prioritising the rest.

Context

The peer-to-peer payment platform was growing faster than manual review could keep up with. Moderators were backlogged, response times were slow, and bad transactions were slipping through — not because the signals were invisible, but because no one had wired them up. The problem had three parts: identify suspicious activity automatically, give moderators what they needed to act quickly on the rest, and make the system improve over time.

The system

Behavioural signals. The scoring model ran on four primary signals: transaction velocity (how many transactions from this account in the last N hours), amount patterns (unusual amounts relative to the account’s history), device fingerprinting (new or shared devices), and account age (new accounts transact at higher base rates of fraud). Each signal was assigned a weight; signals combined into a risk score between 0 and 100.

Risk score to action mapping. Low scores auto-approved. Medium scores applied soft limits (reduced transaction caps, step-up authentication). High scores held the transaction for manual review or rejected it outright, depending on the score ceiling. The threshold between tiers was calibrated iteratively — the initial values were set from historical data, then adjusted based on moderator feedback.

Moderator dashboard. The review queue was prioritised by risk score, not by arrival time, so moderators always looked at the most suspicious transactions first. Each transaction surfaced a timeline of the account’s recent activity alongside the individual signal contributions to the score — so a moderator could see not just “risk score: 87” but “velocity spike in the last 2 hours, new device, third transaction today.” One-click approve, hold, and reject actions kept the review loop fast.

Feedback loop. Moderator decisions fed back into the rule system. When a moderator marked a transaction as fraudulent, the signals present in that transaction were logged; over time, the weight calibration was updated from the accumulated signal-decision pairs. The loop ran on a weekly cadence rather than in real time, which kept it stable and auditable.

Why rules, not ML

The temptation was to reach for an ML model — this is the kind of problem that looks like a classification task and has enough data to train on. The choice to stay with explicit rules was deliberate. Moderators needed to be able to challenge a decision: “this transaction was flagged — why?” is a question that comes up in customer disputes, and “the model assigned it 0.87” is not an answer that satisfies anyone. Rule-based scores explain themselves. They’re also easier to tune from feedback: if the velocity signal is producing too many false positives on a particular user segment, you adjust the weight for that segment and see the change immediately. With a model, you retrain, validate, shadow-deploy, and wait. For the scale and ML maturity of the team at the time, the rule-based approach was the right call — not a fallback, but the better fit.

What was hard

False-positive tuning. Legitimate transactions getting flagged caused real user friction — a payment held for review is a frustrated customer, sometimes a lost transaction. The initial thresholds were too aggressive; the first week of production surfaced that. Calibrating them down without opening a gap for fraud was ongoing work. It wasn’t a launch task; it was a monthly calibration task for the first six months, then quarterly after that. The moderator dashboard became the primary tool for this: when the false-positive queue filled with obvious legitimate transactions, the thresholds needed adjusting.

Signal weighting across user segments. Device fingerprinting is a strong signal for new accounts but almost meaningless for established users who occasionally log in from a new device. Account age means something different for a merchant account than for a personal one. The weighting had to vary by segment, and those variations had to be defensible — a moderator who disagreed with a decision should be able to understand why the system scored it the way it did, and the explanation should hold up. Getting this right required close collaboration with the moderator team, who had the domain knowledge we didn’t.

Outcome

	Before	After
Manual review burden	~100% of transactions	~30% (high-risk only)
Moderator workload	Baseline	~70% reduction
Response time	Slow, backlogged	Near-instant for auto-decisions

Stack

Python, Django, PostgreSQL.