Training Pipeline (DVC)¶

Purpose¶

Document the end-to-end ML pipeline — from raw data ingestion through feature engineering, model training, registry promotion, and batch inference — as it is actually defined in dvc.yaml.

Data sources and ETL internals are described in Data. Model registry lifecycle is in Model Registry.

Pipeline at a glance¶

The pipeline is defined in dvc.yaml as 22 DVC stages organised into six logical groups:

Group	Stages	Purpose
Ingestion	`load_data_from_sources`, `load_odds_fdco`	Pull versioned data from MinIO and external odds source
Validation & Prep	`validate_raw`, `export_metadata`, `preprocessing`, `validate_finished`, `validate_future`	Clean data, extract lookups, GE contract checks
Features	`generate_features_meta`, `feature_engineering`, `validate_features`	ELO + rolling stats, feature metadata, GE feature check
Training	`split_data`, `classification_models`, `tune_xgb`, `tune_logreg`, `tune_hgb`, `select_model`, `final_train`	Time-based split, Optuna tuning, model selection, calibrated final train
Registry	`register_model`, `promote_model`	Register to MLflow, auto-gate `candidate` alias
Serving & Analysis	`batch_inference`, `analysis`, `monitor_drift`	Pre-compute predictions, error/ROI analysis, Evidently drift

Run dvc dag for the live dependency graph.

DAG diagram¶

flowchart TD
    subgraph ing ["Ingestion"]
        SRC([load_data_from_sources])
        ODD([load_odds_fdco])
    end

    subgraph prep ["Validation & Prep"]
        VR[validate_raw]
        EM[export_metadata]
        PP[preprocessing]
        VFI[validate_finished]
        VFU[validate_future]
    end

    subgraph feat ["Features"]
        GFM[generate_features_meta]
        FE[feature_engineering]
        VFF[validate_features]
    end

    subgraph train ["Training"]
        SD[split_data]
        CM[classification_models]
        TX[tune_xgb]
        TH[tune_hgb]
        TL[tune_logreg ⚠️]
        SEL[select_model]
        FT[final_train]
    end

    subgraph reg ["Registry"]
        RM[register_model]
        PM[promote_model]
    end

    subgraph srv ["Serving & Analysis"]
        BI[batch_inference]
        AN[analysis]
        MD[monitor_drift]
    end

    SRC --> VR --> PP
    SRC --> EM
    PP --> VFI & VFU
    PP --> FE
    GFM --> VFF
    FE --> VFF --> SD
    SD --> CM & TX & TH & TL
    TX & TH & TL --> SEL
    SEL & CM --> FT
    FT --> RM --> PM
    PP & FE --> BI
    BI --> AN & MD

    classDef validate fill:#e8f4e8,stroke:#4a7c4a,color:#1a3a1a
    classDef ml fill:#e8e8f4,stroke:#4a4a7c,color:#1a1a3a
    classDef disabled fill:#f4f0e8,stroke:#9a8060,color:#6a5030,stroke-dasharray:4 4
    classDef registry fill:#f0e8f4,stroke:#6a4a7c,color:#3a1a4a

    class VR,VFI,VFU,VFF validate
    class CM,TX,TH,SEL,FT ml
    class TL disabled
    class RM,PM registry

⚠️ tune_logreg is in the DAG but runs with tuning_logreg.enabled: false in the current production config — it produces a placeholder output and is excluded from select_model. See Production configuration below.

Stage reference¶

Ingestion & validation¶

Stage	Role	Key outputs
`load_data_from_sources`	Pull match data from MinIO (WhoScored.com scrape)	`data/raw/match.parquet`, `data/raw/match_raw.parquet`
`load_odds_fdco`	Download football-data.co.uk historical odds	`data/raw/odds_fdco.parquet`
`validate_raw`	Great Expectations check on raw match data	`data/evaluation/ge_raw.json`
`export_metadata`	Extract team/tournament/region/season lookup maps	`data/metadata/*.json` (6 lookup files)
`preprocessing`	Clean matches, clip score outliers, partition into finished vs future	`data/interim/finished.parquet`, `data/interim/future.parquet`
`validate_finished`	GE check on finished-match data	`data/evaluation/ge_finished.json`
`validate_future`	GE check on upcoming-match data	`data/evaluation/ge_future.json`

Feature engineering¶

Stage	Role	Key outputs
`generate_features_meta`	Build feature column metadata from params (no data dependency)	`data/features/features_meta.parquet`
`feature_engineering`	Compute ELO ratings + rolling stats on finished matches	`data/features/features.parquet`
`validate_features`	GE check on feature matrix	`data/evaluation/ge_features.json`

Training path¶

Stage	Role	Key outputs
`split_data`	Time-based train/test split (test ≥ 2024-01-01) + walk-forward CV folds (2016–2024)	`data/processed/dataset.parquet`, `data/splits/{train,test,folds}.parquet`
`classification_models`	Baseline exploration: runs multiple classifier types at 1% data fraction; logs to MLflow	`data/models/run_id.json`
`tune_xgb`	Optuna search over XGBoost hyperparameters using walk-forward CV (10% data, 20 trials)	`data/models/xgb_best_params.json`
`tune_hgb`	Optuna search over HistGradientBoosting (10% data, 20 trials)	`data/models/hgb_best_params.json`
`tune_logreg`	Optuna search over LogisticRegression — disabled in production (`tuning_logreg.enabled: false`)	`data/models/logreg_best_params.json` (placeholder)
`select_model`	Compare tuned candidates by CV log-loss; writes winner to `best_model.json`	`data/models/best_model.json`
`final_train`	Train winner on full train set; hold out test set for one-shot evaluation; apply isotonic calibration	MLflow run; `data/models/final_run_id.json`

Registry¶

Stage	Role	Key outputs
`register_model`	Create/update MLflow Registry entry; assign initial `smoke` alias	`data/models/registered_model.json`
`promote_model`	Quality gate: promote to `candidate` alias if `final.logloss ≤ current_candidate + 0.002`	`data/models/promoted_model.json`

champion promotion is manual-only — see Model Registry.

Serving & analysis¶

Stage	Role	Key outputs
`batch_inference`	Load `champion` model; compute features for upcoming + recent matches; upload to MinIO	`data/predictions/match_features.parquet`, `data/predictions/predictions.parquet`
`analysis`	Error analysis (per-segment log-loss, calibration) + flat-stake ROI simulation against `odds_fdco`	`reports/error_analysis/`, `reports/roi_analysis/`, `data/analysis/`
`monitor_drift`	Evidently dataset drift report on `match_features.parquet`; writes Prometheus textfile	`reports/drift/latest.json`, `reports/drift/metrics.prom`

Production configuration¶

All stage behaviour is driven by params.yaml (generated by Hydra from conf/config.yaml). DVC tracks per-stage param consumption, so dvc params diff shows exactly what changed.

# params.yaml — relevant ML sections

temporal:
  test_start: "2024-01-01"     # holdout lower bound
  folds_start_year: 2016       # first CV fold validation year
  folds_end_year: 2024         # exclusive upper bound

classification:
  target_col: outcome_1x2
  frac: 0.01                   # 1% sample — exploratory baseline only
  side: [home, diff, away]
  cat_cols: [sex]
  include_elo: false           # ELO excluded from the baseline exploration stage
  include_rest_days: false
  include_h2h: false

features_selected:             # feature set used for tuning + final_train
  side: [home, diff, away]
  window_sizes: [1, 2, 3, 5, 7, 10, 12]
  include_elo: true            # ELO enabled for tuned models
  include_rest_days: false     # rest-days features implemented but not currently active
  include_h2h: false           # H2H features implemented but not currently active
  class_weight: {0: 1.0, 1: 1.25, 2: 1.0}

tuning:
  n_trials: 20
  frac: 0.1                    # 10% sample for Optuna walk-forward CV

tuning_logreg:
  enabled: false               # LogReg tuning disabled; XGB + HGB are the active candidates

select_model:
  experiment_name: matches_clf_v1.0_select

final_train:
  calibration:
    enabled: true
    method: isotonic
    calib_frac: 0.15           # 15% of train set reserved for calibration
    min_calib_samples: 100

register_model:
  model_name: soccer-match-outcome
  model_stage: smoke           # initial alias; never serving-ready on its own

promote_model:
  metric: final.logloss
  tolerance: 0.002
  candidate_alias: candidate

inference:
  model_name: soccer-match-outcome
  model_stage: champion        # batch_inference always loads the champion alias

Feature nuance: include_rest_days and include_h2h are false in the production config. Both feature families are fully implemented in src/features/stats_matches.py and can be re-enabled via params.yaml. They are excluded from the current production model because ablation showed marginal or noisy contribution at the current data scale.

Experiment overlays (Hydra)¶

Pipeline scale is controlled by Hydra overlays in conf/experiment/. They override only the keys they need — the rest falls through to conf/config.yaml.

Overlay	Trigger	Key overrides	Purpose
`experiment=smoke`	CI `train:smoke` job	`frac=0.001`, `n_trials=2`, `model_stage=ci-smoke`	Wiring check only — confirms all 22 stages execute without error; toy model never used by serving
`experiment=test`	CI `train:test` job	`tuning.frac=0.2`, `n_trials=20`	Reduced-scale real-data run; promotes to `smoke` → `candidate`
(base)	`dvc repro` default	Full `params.yaml`	Production run

Usage:

dvc exp run -S 'experiment=smoke'   # CI wiring check
dvc exp run -S 'experiment=test'    # reduced-scale training
dvc repro                           # full production run

Calibration¶

final_train splits the training set: 85% trains the model, 15% calibrates it.

train_ids
  ├─ 85% → model fit  (XGBoost / HGB with tuned params)
  └─ 15% → isotonic regression calibration (min 100 samples required)

Calibration is applied post-fit via CalibratedClassifierCV(method="isotonic"). The calibrated model is logged to MLflow as a single artifact — the serving layer loads the calibrated wrapper, not the raw classifier.

If min_calib_samples is not reached, calibration is skipped and a warning is logged. This guard is relevant for very small smoke runs.

Reproducibility contract¶

Given the same: - git commit, - DVC dataset version (dvc pull), - params.yaml,

dvc repro produces identical results. This is enforced by DVC content-addressing, explicit random seed management in training code, and the Hydra config snapshot logged to each MLflow run.

dvc params diff and dvc status will show any deviation from the last locked state.

CI integration¶

The train:smoke GitLab CI job runs the full 22-stage pipeline with experiment=smoke overrides (toy data, 2 Optuna trials) to verify graph wiring on every relevant push. The train:test job runs a reduced-scale real-data pipeline and promotes the result to the smoke and optionally candidate aliases.

Contract tests in tests/contract/test_pipeline_contracts.py validate that stage input/output schemas match their declared contracts independent of a full DVC run.

Running the pipeline¶

# Full pipeline (data ingestion → inference → analysis)
dvc repro

# Training path only (assumes data artifacts are current)
dvc repro split_data tune_xgb tune_hgb select_model final_train register_model promote_model

# Inference + analysis only
dvc repro batch_inference analysis monitor_drift

# Inspect param changes since last run
dvc params diff

# View the live dependency graph
dvc dag

Features — feature families, leakage prevention, offline/online parity
Tuning — Optuna walk-forward CV details
MLflow — experiment tracking and run tagging
Model Registry — alias lifecycle, promotion policy, rollback
Validation Strategy — Great Expectations suites at each stage
Data: Versioning — DVC artifact versioning
Architecture: Data & ML Flow