Machine Learning Subsystem¶

Purpose¶

This section documents the ML subsystem of SoccerPredictAI: problem framing, validation discipline, feature logic, training lifecycle, experiment tracking, model contracts, and registry promotion.

It is not a second architecture overview and does not repeat data engineering concerns.

For system-level design, boundaries, and deployment topology, see Architecture.
For datasets, lineage, contracts, and reproducibility boundary, see Data.
For inference modes, API schemas, and serving behaviour, see Serving.
For implementation readiness of each component, see Status.

Scope¶

The ML subsystem covers everything from versioned feature artifacts to a promoted, serving-ready model version:

what the model predicts and why this is an ML problem,
how success is defined and measured,
why temporal validation is mandatory and how it is enforced,
how features are designed to prevent leakage and maintain offline/online parity,
how training is orchestrated reproducibly via DVC,
how experiments are tracked and traced via MLflow,
what the model interface contract is,
how models move from training into serving via the registry,
what the current limitations are.

Design principles¶

Principle	What it means in practice
Reproducibility by default	Same git commit + DVC dataset version + `params.yaml` → identical results
Validation over optimisation	A correct temporal split takes priority over any marginal metric gain
Leakage is a critical bug	Any feature that encodes future information invalidates the experiment
Explicit contracts	Model input/output schemas are versioned alongside model artifacts
Offline/online parity	Feature logic is shared; no ad-hoc transformations at inference
Registry as handoff	The only path from training to serving is through the MLflow registry

ML lifecycle¶

flowchart LR
    A[Versioned Features\ndata/features/] --> B[Temporal Split\nsplit_data]
    B --> C[Tune\ntune_xgb]
    B --> D[Baseline Models\nclassification_models]
    C --> E[Final Train\nfinal_train]
    D --> E
    E --> F[Evaluate on\nheld-out test]
    F --> G[MLflow Run\nlogged + traced]
    G --> H[Registry\nregister_model]
    H --> I[Serving\nFastAPI + Celery]

Each step is a DVC stage. Execution order is determined by the DAG in dvc.yaml. MLflow is the observability and traceability layer across all training stages.

Pages in this section¶

Page	Covers
Problem	Prediction task, target construction, success definition
Baseline	Naive baselines, bookmaker benchmark, promotion gate
Validation	Temporal splits, leakage prevention, property tests
Features	Implemented feature families, parity rules, excluded types
Training Pipeline	DVC orchestration, stages, determinism
Tuning	Optuna search, time-aware CV, best-params flow
MLflow	Experiment tracking, run structure, lineage
Model Contract	Input/output schema, breaking changes, versioning
Model Registry	Lifecycle stages, promotion gate, rollback, hard gates, checklist
Limitations	Current limitations, justified future improvements

Analysis Reports¶

Quarto-generated reports are the authoritative source of evidence for model behaviour, feature quality, and experiment results. They are generated directly from pipeline code and data — not written manually.

Principle: when a report and a documentation page disagree, the report is correct.

Reports are rendered by running:

quarto render reports/qmd/              # all reports
quarto render reports/qmd/05_holdout_analysis.qmd  # single report

#	Report	What it shows
—	Pipeline Summary	Entry point; all reports linked with pipeline mermaid
01	EDA & Preprocessing	Raw data overview, target analysis, class balance, league distribution
02	Feature Engineering	ELO calibration evidence, rolling stat signal analysis, temporal drift
03	Experiment Studies v1.01–v1.05	218 MLflow runs; 7 studies: learning curve, window size, model ablation, feature ablation
04	Model Analysis	Champion hyperparameters, calibration curves, SHAP importance, tuning history
05	Holdout Analysis	Hold-out metrics, error analysis by Elo gap / league / season, ROI simulation
06	Live Inference & Odds	`batch_inference` DVC stage walkthrough, live Fonbet odds pipeline (Airflow 3-stage DAG)
07	Live Betting Strategy	Flat-stake and fractional-Kelly simulation on Fonbet odds; region-by-region ROI breakdown

Report	Corresponding doc page
02 — Feature Engineering	Feature Engineering — design; report has evidence
03 — Experiment Studies	Tuning, Training Pipeline
04 — Model Analysis	Model Registry
05 — Holdout Analysis	Baseline & Success Metrics, Portfolio Results
06 — Live Inference	Training Pipeline `batch_inference` stage
07 — Live Betting	Portfolio Results ROI section