Machine Learning Subsystem¶
Purpose¶
This section documents the ML subsystem of SoccerPredictAI: problem framing, validation discipline, feature logic, training lifecycle, experiment tracking, model contracts, and registry promotion.
It is not a second architecture overview and does not repeat data engineering concerns.
- For system-level design, boundaries, and deployment topology, see Architecture.
- For datasets, lineage, contracts, and reproducibility boundary, see Data.
- For inference modes, API schemas, and serving behaviour, see Serving.
- For implementation readiness of each component, see Status.
Scope¶
The ML subsystem covers everything from versioned feature artifacts to a promoted, serving-ready model version:
- what the model predicts and why this is an ML problem,
- how success is defined and measured,
- why temporal validation is mandatory and how it is enforced,
- how features are designed to prevent leakage and maintain offline/online parity,
- how training is orchestrated reproducibly via DVC,
- how experiments are tracked and traced via MLflow,
- what the model interface contract is,
- how models move from training into serving via the registry,
- what the current limitations are.
Design principles¶
| Principle | What it means in practice |
|---|---|
| Reproducibility by default | Same git commit + DVC dataset version + params.yaml → identical results |
| Validation over optimisation | A correct temporal split takes priority over any marginal metric gain |
| Leakage is a critical bug | Any feature that encodes future information invalidates the experiment |
| Explicit contracts | Model input/output schemas are versioned alongside model artifacts |
| Offline/online parity | Feature logic is shared; no ad-hoc transformations at inference |
| Registry as handoff | The only path from training to serving is through the MLflow registry |
ML lifecycle¶
flowchart LR
A[Versioned Features\ndata/features/] --> B[Temporal Split\nsplit_data]
B --> C[Tune\ntune_xgb]
B --> D[Baseline Models\nclassification_models]
C --> E[Final Train\nfinal_train]
D --> E
E --> F[Evaluate on\nheld-out test]
F --> G[MLflow Run\nlogged + traced]
G --> H[Registry\nregister_model]
H --> I[Serving\nFastAPI + Celery]
Each step is a DVC stage. Execution order is determined by the DAG in dvc.yaml.
MLflow is the observability and traceability layer across all training stages.
Pages in this section¶
| Page | Covers |
|---|---|
| Problem | Prediction task, target construction, success definition |
| Baseline | Naive baselines, bookmaker benchmark, promotion gate |
| Validation | Temporal splits, leakage prevention, property tests |
| Features | Implemented feature families, parity rules, excluded types |
| Training Pipeline | DVC orchestration, stages, determinism |
| Tuning | Optuna search, time-aware CV, best-params flow |
| MLflow | Experiment tracking, run structure, lineage |
| Model Contract | Input/output schema, breaking changes, versioning |
| Model Registry | Lifecycle stages, promotion gate, rollback, hard gates, checklist |
| Limitations | Current limitations, justified future improvements |
Analysis Reports¶
Quarto-generated reports are the authoritative source of evidence for model behaviour, feature quality, and experiment results. They are generated directly from pipeline code and data — not written manually.
Principle: when a report and a documentation page disagree, the report is correct.
Reports are rendered by running:
quarto render reports/qmd/ # all reports
quarto render reports/qmd/05_holdout_analysis.qmd # single report
| # | Report | What it shows |
|---|---|---|
| — | Pipeline Summary | Entry point; all reports linked with pipeline mermaid |
| 01 | EDA & Preprocessing | Raw data overview, target analysis, class balance, league distribution |
| 02 | Feature Engineering | ELO calibration evidence, rolling stat signal analysis, temporal drift |
| 03 | Experiment Studies v1.01–v1.05 | 218 MLflow runs; 7 studies: learning curve, window size, model ablation, feature ablation |
| 04 | Model Analysis | Champion hyperparameters, calibration curves, SHAP importance, tuning history |
| 05 | Holdout Analysis | Hold-out metrics, error analysis by Elo gap / league / season, ROI simulation |
| 06 | Live Inference & Odds | batch_inference DVC stage walkthrough, live Fonbet odds pipeline (Airflow 3-stage DAG) |
| 07 | Live Betting Strategy | Flat-stake and fractional-Kelly simulation on Fonbet odds; region-by-region ROI breakdown |
| Report | Corresponding doc page |
|---|---|
| 02 — Feature Engineering | Feature Engineering — design; report has evidence |
| 03 — Experiment Studies | Tuning, Training Pipeline |
| 04 — Model Analysis | Model Registry |
| 05 — Holdout Analysis | Baseline & Success Metrics, Portfolio Results |
| 06 — Live Inference | Training Pipeline batch_inference stage |
| 07 — Live Betting | Portfolio Results ROI section |