Results¶
Model: soccer-match-outcome @ smoke · Holdout: 135 970 matches (2024+, temporal split)
Model performance — holdout (2024+)¶
| Metric | Model | Bookmaker benchmark | Random baseline |
|---|---|---|---|
| Log-Loss | 1.006 | ~0.97 | ~1.099 |
| Brier Score | 0.601 | — | ~0.667 |
| ROC AUC (OVR) | 0.643 | — | 0.500 |
| Accuracy | 50.4% | ~53% | ~33% |
| ECE (calibration) | 0.004 | — | — |
The model beats all naive baselines (random, hard prior, soft prior) and is well-calibrated
(ECE ≈ 0.004). It does not yet match the bookmaker benchmark on log-loss — consistent with
smoke alias status (first promotion stage, before candidate).
This is the v1 champion (
smokealias). See Promotion Policy for the log-loss gate required to advance tocandidateand thenchampion.
Model comparison (MLflow experiment: matches_clf_v1)¶
| Model | Log-loss ↓ | Brier ↓ | ECE ↓ | ROC-AUC OvR ↑ | Notes |
|---|---|---|---|---|---|
| Marginal (class prior) | 1.0712 | — | n/a | n/a | Historical class frequencies; calibrated hard baseline |
| Elo-only (logistic regression) | — | — | — | — | 3 ELO features only |
| LogReg (full features) | 1.0053 (CV) | — | — | — | −6.2% vs baseline; CV on 10% data (test_v1.01) |
| XGBoost (tuned, uncalibrated) | 1.0063 (CV) | — | — | — | CV on 10% data (test_v1.01) |
| HGB (tuned) | 1.0074 (CV) | — | — | — | CV on 10% data (test_v1.01) |
| Best model (champion) — hold-out | 1.006 | 0.601 | 0.004 | 0.643 | Log-Loss −5.8% vs marginal baseline; 135 970 hold-out matches (2024+) |
CV values are from
test_v1.01(10% data, n_trials=20). Hold-out values are frommatches_clf_v1experiment,soccer-match-outcome @ smoke. Full run history: MLflow UI → experimentmatches_clf_v1.
Slice diagnostics¶
Performance varies meaningfully by match context:
| Slice | Best log-loss | Worst log-loss | Key insight |
|---|---|---|---|
| Elo gap — large (300+) | 0.632 | — | High-confidence matches predictable |
| Elo gap — even (<50) | — | 1.061 | Evenly-matched fixtures hardest |
| Region — Faroe Islands | 0.793 | — | Small leagues more predictable |
| Region — Nigeria | — | 0.938 | High-noise leagues near random |
Full breakdown: Holdout Analysis report.
Calibration¶
Calibration applied post-hoc with a temporal split (no holdout leakage). Both sigmoid and isotonic methods evaluated; lower ECE is registered.
| Raw ECE | Calibrated ECE | Method chosen | |
|---|---|---|---|
| XGBoost champion | 0.004 | 0.004 | isotonic |
Calibration evidence: Model Analysis report.
ROI simulation¶
Flat-stake simulation on holdout matches matched to Fonbet closing odds.
| Metric | Value |
|---|---|
| Holdout matches | 135 970 |
| ROI % (flat-stake, all regions) | Near break-even (−0.70%) |
| ROI % (value bets, filtered regions) | Positive in regions with consistent edge |
Region-level filtering is the primary lever: dropping regions with negative ROI concentrates bets where the model has consistent edge. Full analysis: Live Betting Strategy report.
System scale¶
| Dimension | Value |
|---|---|
| Training data | Multi-league, multi-season historical fixtures |
| Holdout set | 135 970 matches (2024+) |
| DVC pipeline stages | 20 |
| Feature groups | Elo ratings, rolling match stats, league position, H2H, rest days |
| MLflow experiments | Tracked across tune → final_train → ablation (218 runs) |
| Model aliases | smoke → candidate → champion (manual gate at champion) |
| API latency (sync) | Sub-second; p50 ~126 ms on local dev (WSL/Docker) |
| Test suite | 564 passing tests (unit, property, service, contract, load) |
| Infrastructure cost | <€30/month (self-hosted single VPS) |
Validation integrity¶
| Check | Status |
|---|---|
| Temporal split enforced (no future leakage) | ✅ |
Train / serve feature parity (select_model_features()) |
✅ |
| Bookmaker odds not used as input features | ✅ |
| Calibration (ECE 0.004) | ✅ |
| Holdout fully outside training window | ✅ |
| DVC-tracked lineage (data → features → model) | ✅ |
See Validation Strategy for full details.
Serving performance¶
| Endpoint | p50 | p95 | p99 | RPS | Notes |
|---|---|---|---|---|---|
GET /predict/{match_id} |
126 ms | 442 ms | 460 ms | 18 | 3 concurrent, 60 req, local dev (WSL/Docker) |
GET /healthcheck/ |
8 ms | 254 ms | — | 100 | 5 concurrent, 100 req |
Production single-node K8s ceiling: ~5 RPS (1 Gunicorn worker, Celery concurrency=1). Full Locust report:
locust -f tests/load/locustfile.py --host http://localhost:8000
Related¶
- Analysis Reports — full Quarto-generated evidence
- Validation Strategy — temporal split and leakage prevention
- Implementation Status — what is operational today
Interpretation¶
Filled after prod run completes. Template:
"Model beats Elo-only baseline by N on Brier score; calibration ECE is acceptable above 0.4 confidence. Draw class is hardest to predict (lowest recall). Underperforms on lower-tier leagues — see error analysis. ROI simulation shows the model has N% edge over uniform prior; draw bets have negative edge."
See Promotion Policy for the thresholds used to gate Staging → Production. See Baseline & Success Metrics for the full metrics specification.