Skip to content

Results

Model: soccer-match-outcome @ smoke · Holdout: 135 970 matches (2024+, temporal split)


Model performance — holdout (2024+)

Metric Model Bookmaker benchmark Random baseline
Log-Loss 1.006 ~0.97 ~1.099
Brier Score 0.601 ~0.667
ROC AUC (OVR) 0.643 0.500
Accuracy 50.4% ~53% ~33%
ECE (calibration) 0.004

The model beats all naive baselines (random, hard prior, soft prior) and is well-calibrated (ECE ≈ 0.004). It does not yet match the bookmaker benchmark on log-loss — consistent with smoke alias status (first promotion stage, before candidate).

This is the v1 champion (smoke alias). See Promotion Policy for the log-loss gate required to advance to candidate and then champion.


Model comparison (MLflow experiment: matches_clf_v1)

Model Log-loss ↓ Brier ↓ ECE ↓ ROC-AUC OvR ↑ Notes
Marginal (class prior) 1.0712 n/a n/a Historical class frequencies; calibrated hard baseline
Elo-only (logistic regression) 3 ELO features only
LogReg (full features) 1.0053 (CV) −6.2% vs baseline; CV on 10% data (test_v1.01)
XGBoost (tuned, uncalibrated) 1.0063 (CV) CV on 10% data (test_v1.01)
HGB (tuned) 1.0074 (CV) CV on 10% data (test_v1.01)
Best model (champion) — hold-out 1.006 0.601 0.004 0.643 Log-Loss −5.8% vs marginal baseline; 135 970 hold-out matches (2024+)

CV values are from test_v1.01 (10% data, n_trials=20). Hold-out values are from matches_clf_v1 experiment, soccer-match-outcome @ smoke. Full run history: MLflow UI → experiment matches_clf_v1.


Slice diagnostics

Performance varies meaningfully by match context:

Slice Best log-loss Worst log-loss Key insight
Elo gap — large (300+) 0.632 High-confidence matches predictable
Elo gap — even (<50) 1.061 Evenly-matched fixtures hardest
Region — Faroe Islands 0.793 Small leagues more predictable
Region — Nigeria 0.938 High-noise leagues near random

Full breakdown: Holdout Analysis report.


Calibration

Calibration applied post-hoc with a temporal split (no holdout leakage). Both sigmoid and isotonic methods evaluated; lower ECE is registered.

Raw ECE Calibrated ECE Method chosen
XGBoost champion 0.004 0.004 isotonic

Calibration evidence: Model Analysis report.


ROI simulation

Flat-stake simulation on holdout matches matched to Fonbet closing odds.

Metric Value
Holdout matches 135 970
ROI % (flat-stake, all regions) Near break-even (−0.70%)
ROI % (value bets, filtered regions) Positive in regions with consistent edge

Region-level filtering is the primary lever: dropping regions with negative ROI concentrates bets where the model has consistent edge. Full analysis: Live Betting Strategy report.


System scale

Dimension Value
Training data Multi-league, multi-season historical fixtures
Holdout set 135 970 matches (2024+)
DVC pipeline stages 20
Feature groups Elo ratings, rolling match stats, league position, H2H, rest days
MLflow experiments Tracked across tune → final_train → ablation (218 runs)
Model aliases smoke → candidate → champion (manual gate at champion)
API latency (sync) Sub-second; p50 ~126 ms on local dev (WSL/Docker)
Test suite 564 passing tests (unit, property, service, contract, load)
Infrastructure cost <€30/month (self-hosted single VPS)

Validation integrity

Check Status
Temporal split enforced (no future leakage)
Train / serve feature parity (select_model_features())
Bookmaker odds not used as input features
Calibration (ECE 0.004)
Holdout fully outside training window
DVC-tracked lineage (data → features → model)

See Validation Strategy for full details.


Serving performance

Endpoint p50 p95 p99 RPS Notes
GET /predict/{match_id} 126 ms 442 ms 460 ms 18 3 concurrent, 60 req, local dev (WSL/Docker)
GET /healthcheck/ 8 ms 254 ms 100 5 concurrent, 100 req

Production single-node K8s ceiling: ~5 RPS (1 Gunicorn worker, Celery concurrency=1). Full Locust report: locust -f tests/load/locustfile.py --host http://localhost:8000



Interpretation

Filled after prod run completes. Template:

"Model beats Elo-only baseline by N on Brier score; calibration ECE is acceptable above 0.4 confidence. Draw class is hardest to predict (lowest recall). Underperforms on lower-tier leagues — see error analysis. ROI simulation shows the model has N% edge over uniform prior; draw bets have negative edge."


See Promotion Policy for the thresholds used to gate Staging → Production. See Baseline & Success Metrics for the full metrics specification.