Results¶

Model: soccer-match-outcome @ smoke · Holdout: 135 970 matches (2024+, temporal split)

Model performance — holdout (2024+)¶

Metric	Model	Bookmaker benchmark	Random baseline
Log-Loss	1.006	~0.97	~1.099
Brier Score	0.601	—	~0.667
ROC AUC (OVR)	0.643	—	0.500
Accuracy	50.4%	~53%	~33%
ECE (calibration)	0.004	—	—

The model beats all naive baselines (random, hard prior, soft prior) and is well-calibrated (ECE ≈ 0.004). It does not yet match the bookmaker benchmark on log-loss — consistent with smoke alias status (first promotion stage, before candidate).

This is the v1 champion (smoke alias). See Promotion Policy for the log-loss gate required to advance to candidate and then champion.

Model comparison (MLflow experiment: `matches_clf_v1`)¶

Model	Log-loss ↓	Brier ↓	ECE ↓	ROC-AUC OvR ↑	Notes
Marginal (class prior)	1.0712	—	n/a	n/a	Historical class frequencies; calibrated hard baseline
Elo-only (logistic regression)	—	—	—	—	3 ELO features only
LogReg (full features)	1.0053 (CV)	—	—	—	−6.2% vs baseline; CV on 10% data (test_v1.01)
XGBoost (tuned, uncalibrated)	1.0063 (CV)	—	—	—	CV on 10% data (test_v1.01)
HGB (tuned)	1.0074 (CV)	—	—	—	CV on 10% data (test_v1.01)
Best model (champion) — hold-out	1.006	0.601	0.004	0.643	Log-Loss −5.8% vs marginal baseline; 135 970 hold-out matches (2024+)

CV values are from test_v1.01 (10% data, n_trials=20). Hold-out values are from matches_clf_v1 experiment, soccer-match-outcome @ smoke. Full run history: MLflow UI → experiment matches_clf_v1.

Slice diagnostics¶

Performance varies meaningfully by match context:

Slice	Best log-loss	Worst log-loss	Key insight
Elo gap — large (300+)	0.632	—	High-confidence matches predictable
Elo gap — even (<50)	—	1.061	Evenly-matched fixtures hardest
Region — Faroe Islands	0.793	—	Small leagues more predictable
Region — Nigeria	—	0.938	High-noise leagues near random

Full breakdown: Holdout Analysis report.

Calibration¶

Calibration applied post-hoc with a temporal split (no holdout leakage). Both sigmoid and isotonic methods evaluated; lower ECE is registered.

	Raw ECE	Calibrated ECE	Method chosen
XGBoost champion	0.004	0.004	isotonic

Calibration evidence: Model Analysis report.

ROI simulation¶

Flat-stake simulation on holdout matches matched to Fonbet closing odds.

Metric	Value
Holdout matches	135 970
ROI % (flat-stake, all regions)	Near break-even (−0.70%)
ROI % (value bets, filtered regions)	Positive in regions with consistent edge

Region-level filtering is the primary lever: dropping regions with negative ROI concentrates bets where the model has consistent edge. Full analysis: Live Betting Strategy report.

System scale¶

Dimension	Value
Training data	Multi-league, multi-season historical fixtures
Holdout set	135 970 matches (2024+)
DVC pipeline stages	20
Feature groups	Elo ratings, rolling match stats, league position, H2H, rest days
MLflow experiments	Tracked across tune → final_train → ablation (218 runs)
Model aliases	`smoke → candidate → champion` (manual gate at `champion`)
API latency (sync)	Sub-second; p50 ~126 ms on local dev (WSL/Docker)
Test suite	564 passing tests (unit, property, service, contract, load)
Infrastructure cost	<€30/month (self-hosted single VPS)

Validation integrity¶

Check	Status
Temporal split enforced (no future leakage)	✅
Train / serve feature parity (`select_model_features()`)	✅
Bookmaker odds not used as input features	✅
Calibration (ECE 0.004)	✅
Holdout fully outside training window	✅
DVC-tracked lineage (data → features → model)	✅

See Validation Strategy for full details.

Serving performance¶

Endpoint	p50	p95	p99	RPS	Notes
`GET /predict/{match_id}`	126 ms	442 ms	460 ms	18	3 concurrent, 60 req, local dev (WSL/Docker)
`GET /healthcheck/`	8 ms	254 ms	—	100	5 concurrent, 100 req

Production single-node K8s ceiling: ~5 RPS (1 Gunicorn worker, Celery concurrency=1). Full Locust report: locust -f tests/load/locustfile.py --host http://localhost:8000

Analysis Reports — full Quarto-generated evidence
Validation Strategy — temporal split and leakage prevention
Implementation Status — what is operational today

Interpretation¶

Filled after prod run completes. Template:

"Model beats Elo-only baseline by N on Brier score; calibration ECE is acceptable above 0.4 confidence. Draw class is hardest to predict (lowest recall). Underperforms on lower-tier leagues — see error analysis. ROI simulation shows the model has N% edge over uniform prior; draw bets have negative edge."

See Promotion Policy for the thresholds used to gate Staging → Production. See Baseline & Success Metrics for the full metrics specification.