Skip to content

Machine Learning Subsystem

Purpose

This section documents the ML subsystem of SoccerPredictAI: problem framing, validation discipline, feature logic, training lifecycle, experiment tracking, model contracts, and registry promotion.

It is not a second architecture overview and does not repeat data engineering concerns.

  • For system-level design, boundaries, and deployment topology, see Architecture.
  • For datasets, lineage, contracts, and reproducibility boundary, see Data.
  • For inference modes, API schemas, and serving behaviour, see Serving.
  • For implementation readiness of each component, see Status.

Scope

The ML subsystem covers everything from versioned feature artifacts to a promoted, serving-ready model version:

  • what the model predicts and why this is an ML problem,
  • how success is defined and measured,
  • why temporal validation is mandatory and how it is enforced,
  • how features are designed to prevent leakage and maintain offline/online parity,
  • how training is orchestrated reproducibly via DVC,
  • how experiments are tracked and traced via MLflow,
  • what the model interface contract is,
  • how models move from training into serving via the registry,
  • what the current limitations are.

Design principles

Principle What it means in practice
Reproducibility by default Same git commit + DVC dataset version + params.yaml → identical results
Validation over optimisation A correct temporal split takes priority over any marginal metric gain
Leakage is a critical bug Any feature that encodes future information invalidates the experiment
Explicit contracts Model input/output schemas are versioned alongside model artifacts
Offline/online parity Feature logic is shared; no ad-hoc transformations at inference
Registry as handoff The only path from training to serving is through the MLflow registry

ML lifecycle

flowchart LR
    A[Versioned Features\ndata/features/] --> B[Temporal Split\nsplit_data]
    B --> C[Tune\ntune_xgb]
    B --> D[Baseline Models\nclassification_models]
    C --> E[Final Train\nfinal_train]
    D --> E
    E --> F[Evaluate on\nheld-out test]
    F --> G[MLflow Run\nlogged + traced]
    G --> H[Registry\nregister_model]
    H --> I[Serving\nFastAPI + Celery]

Each step is a DVC stage. Execution order is determined by the DAG in dvc.yaml. MLflow is the observability and traceability layer across all training stages.


Pages in this section

Page Covers
Problem Prediction task, target construction, success definition
Baseline Naive baselines, bookmaker benchmark, promotion gate
Validation Temporal splits, leakage prevention, property tests
Features Implemented feature families, parity rules, excluded types
Training Pipeline DVC orchestration, stages, determinism
Tuning Optuna search, time-aware CV, best-params flow
MLflow Experiment tracking, run structure, lineage
Model Contract Input/output schema, breaking changes, versioning
Model Registry Lifecycle stages, promotion gate, rollback, hard gates, checklist
Limitations Current limitations, justified future improvements

Analysis Reports

Quarto-generated reports are the authoritative source of evidence for model behaviour, feature quality, and experiment results. They are generated directly from pipeline code and data — not written manually.

Principle: when a report and a documentation page disagree, the report is correct.

Reports are rendered by running:

quarto render reports/qmd/              # all reports
quarto render reports/qmd/05_holdout_analysis.qmd  # single report
# Report What it shows
Pipeline Summary Entry point; all reports linked with pipeline mermaid
01 EDA & Preprocessing Raw data overview, target analysis, class balance, league distribution
02 Feature Engineering ELO calibration evidence, rolling stat signal analysis, temporal drift
03 Experiment Studies v1.01–v1.05 218 MLflow runs; 7 studies: learning curve, window size, model ablation, feature ablation
04 Model Analysis Champion hyperparameters, calibration curves, SHAP importance, tuning history
05 Holdout Analysis Hold-out metrics, error analysis by Elo gap / league / season, ROI simulation
06 Live Inference & Odds batch_inference DVC stage walkthrough, live Fonbet odds pipeline (Airflow 3-stage DAG)
07 Live Betting Strategy Flat-stake and fractional-Kelly simulation on Fonbet odds; region-by-region ROI breakdown
Report Corresponding doc page
02 — Feature Engineering Feature Engineering — design; report has evidence
03 — Experiment Studies Tuning, Training Pipeline
04 — Model Analysis Model Registry
05 — Holdout Analysis Baseline & Success Metrics, Portfolio Results
06 — Live Inference Training Pipeline batch_inference stage
07 — Live Betting Portfolio Results ROI section