Skip to content

Training Pipeline (DVC)

Purpose

Document the end-to-end ML pipeline — from raw data ingestion through feature engineering, model training, registry promotion, and batch inference — as it is actually defined in dvc.yaml.

Data sources and ETL internals are described in Data. Model registry lifecycle is in Model Registry.


Pipeline at a glance

The pipeline is defined in dvc.yaml as 22 DVC stages organised into six logical groups:

Group Stages Purpose
Ingestion load_data_from_sources, load_odds_fdco Pull versioned data from MinIO and external odds source
Validation & Prep validate_raw, export_metadata, preprocessing, validate_finished, validate_future Clean data, extract lookups, GE contract checks
Features generate_features_meta, feature_engineering, validate_features ELO + rolling stats, feature metadata, GE feature check
Training split_data, classification_models, tune_xgb, tune_logreg, tune_hgb, select_model, final_train Time-based split, Optuna tuning, model selection, calibrated final train
Registry register_model, promote_model Register to MLflow, auto-gate candidate alias
Serving & Analysis batch_inference, analysis, monitor_drift Pre-compute predictions, error/ROI analysis, Evidently drift

Run dvc dag for the live dependency graph.


DAG diagram

flowchart TD
    subgraph ing ["Ingestion"]
        SRC([load_data_from_sources])
        ODD([load_odds_fdco])
    end

    subgraph prep ["Validation & Prep"]
        VR[validate_raw]
        EM[export_metadata]
        PP[preprocessing]
        VFI[validate_finished]
        VFU[validate_future]
    end

    subgraph feat ["Features"]
        GFM[generate_features_meta]
        FE[feature_engineering]
        VFF[validate_features]
    end

    subgraph train ["Training"]
        SD[split_data]
        CM[classification_models]
        TX[tune_xgb]
        TH[tune_hgb]
        TL[tune_logreg ⚠️]
        SEL[select_model]
        FT[final_train]
    end

    subgraph reg ["Registry"]
        RM[register_model]
        PM[promote_model]
    end

    subgraph srv ["Serving & Analysis"]
        BI[batch_inference]
        AN[analysis]
        MD[monitor_drift]
    end

    SRC --> VR --> PP
    SRC --> EM
    PP --> VFI & VFU
    PP --> FE
    GFM --> VFF
    FE --> VFF --> SD
    SD --> CM & TX & TH & TL
    TX & TH & TL --> SEL
    SEL & CM --> FT
    FT --> RM --> PM
    PP & FE --> BI
    BI --> AN & MD

    classDef validate fill:#e8f4e8,stroke:#4a7c4a,color:#1a3a1a
    classDef ml fill:#e8e8f4,stroke:#4a4a7c,color:#1a1a3a
    classDef disabled fill:#f4f0e8,stroke:#9a8060,color:#6a5030,stroke-dasharray:4 4
    classDef registry fill:#f0e8f4,stroke:#6a4a7c,color:#3a1a4a

    class VR,VFI,VFU,VFF validate
    class CM,TX,TH,SEL,FT ml
    class TL disabled
    class RM,PM registry

⚠️ tune_logreg is in the DAG but runs with tuning_logreg.enabled: false in the current production config — it produces a placeholder output and is excluded from select_model. See Production configuration below.


Stage reference

Ingestion & validation

Stage Role Key outputs
load_data_from_sources Pull match data from MinIO (WhoScored.com scrape) data/raw/match.parquet, data/raw/match_raw.parquet
load_odds_fdco Download football-data.co.uk historical odds data/raw/odds_fdco.parquet
validate_raw Great Expectations check on raw match data data/evaluation/ge_raw.json
export_metadata Extract team/tournament/region/season lookup maps data/metadata/*.json (6 lookup files)
preprocessing Clean matches, clip score outliers, partition into finished vs future data/interim/finished.parquet, data/interim/future.parquet
validate_finished GE check on finished-match data data/evaluation/ge_finished.json
validate_future GE check on upcoming-match data data/evaluation/ge_future.json

Feature engineering

Stage Role Key outputs
generate_features_meta Build feature column metadata from params (no data dependency) data/features/features_meta.parquet
feature_engineering Compute ELO ratings + rolling stats on finished matches data/features/features.parquet
validate_features GE check on feature matrix data/evaluation/ge_features.json

Training path

Stage Role Key outputs
split_data Time-based train/test split (test ≥ 2024-01-01) + walk-forward CV folds (2016–2024) data/processed/dataset.parquet, data/splits/{train,test,folds}.parquet
classification_models Baseline exploration: runs multiple classifier types at 1% data fraction; logs to MLflow data/models/run_id.json
tune_xgb Optuna search over XGBoost hyperparameters using walk-forward CV (10% data, 20 trials) data/models/xgb_best_params.json
tune_hgb Optuna search over HistGradientBoosting (10% data, 20 trials) data/models/hgb_best_params.json
tune_logreg Optuna search over LogisticRegression — disabled in production (tuning_logreg.enabled: false) data/models/logreg_best_params.json (placeholder)
select_model Compare tuned candidates by CV log-loss; writes winner to best_model.json data/models/best_model.json
final_train Train winner on full train set; hold out test set for one-shot evaluation; apply isotonic calibration MLflow run; data/models/final_run_id.json

Registry

Stage Role Key outputs
register_model Create/update MLflow Registry entry; assign initial smoke alias data/models/registered_model.json
promote_model Quality gate: promote to candidate alias if final.logloss ≤ current_candidate + 0.002 data/models/promoted_model.json

champion promotion is manual-only — see Model Registry.

Serving & analysis

Stage Role Key outputs
batch_inference Load champion model; compute features for upcoming + recent matches; upload to MinIO data/predictions/match_features.parquet, data/predictions/predictions.parquet
analysis Error analysis (per-segment log-loss, calibration) + flat-stake ROI simulation against odds_fdco reports/error_analysis/, reports/roi_analysis/, data/analysis/
monitor_drift Evidently dataset drift report on match_features.parquet; writes Prometheus textfile reports/drift/latest.json, reports/drift/metrics.prom

Production configuration

All stage behaviour is driven by params.yaml (generated by Hydra from conf/config.yaml). DVC tracks per-stage param consumption, so dvc params diff shows exactly what changed.

# params.yaml — relevant ML sections

temporal:
  test_start: "2024-01-01"     # holdout lower bound
  folds_start_year: 2016       # first CV fold validation year
  folds_end_year: 2024         # exclusive upper bound

classification:
  target_col: outcome_1x2
  frac: 0.01                   # 1% sample — exploratory baseline only
  side: [home, diff, away]
  cat_cols: [sex]
  include_elo: false           # ELO excluded from the baseline exploration stage
  include_rest_days: false
  include_h2h: false

features_selected:             # feature set used for tuning + final_train
  side: [home, diff, away]
  window_sizes: [1, 2, 3, 5, 7, 10, 12]
  include_elo: true            # ELO enabled for tuned models
  include_rest_days: false     # rest-days features implemented but not currently active
  include_h2h: false           # H2H features implemented but not currently active
  class_weight: {0: 1.0, 1: 1.25, 2: 1.0}

tuning:
  n_trials: 20
  frac: 0.1                    # 10% sample for Optuna walk-forward CV

tuning_logreg:
  enabled: false               # LogReg tuning disabled; XGB + HGB are the active candidates

select_model:
  experiment_name: matches_clf_v1.0_select

final_train:
  calibration:
    enabled: true
    method: isotonic
    calib_frac: 0.15           # 15% of train set reserved for calibration
    min_calib_samples: 100

register_model:
  model_name: soccer-match-outcome
  model_stage: smoke           # initial alias; never serving-ready on its own

promote_model:
  metric: final.logloss
  tolerance: 0.002
  candidate_alias: candidate

inference:
  model_name: soccer-match-outcome
  model_stage: champion        # batch_inference always loads the champion alias

Feature nuance: include_rest_days and include_h2h are false in the production config. Both feature families are fully implemented in src/features/stats_matches.py and can be re-enabled via params.yaml. They are excluded from the current production model because ablation showed marginal or noisy contribution at the current data scale.


Experiment overlays (Hydra)

Pipeline scale is controlled by Hydra overlays in conf/experiment/. They override only the keys they need — the rest falls through to conf/config.yaml.

Overlay Trigger Key overrides Purpose
experiment=smoke CI train:smoke job frac=0.001, n_trials=2, model_stage=ci-smoke Wiring check only — confirms all 22 stages execute without error; toy model never used by serving
experiment=test CI train:test job tuning.frac=0.2, n_trials=20 Reduced-scale real-data run; promotes to smokecandidate
(base) dvc repro default Full params.yaml Production run

Usage:

dvc exp run -S 'experiment=smoke'   # CI wiring check
dvc exp run -S 'experiment=test'    # reduced-scale training
dvc repro                           # full production run


Calibration

final_train splits the training set: 85% trains the model, 15% calibrates it.

train_ids
  ├─ 85% → model fit  (XGBoost / HGB with tuned params)
  └─ 15% → isotonic regression calibration (min 100 samples required)

Calibration is applied post-fit via CalibratedClassifierCV(method="isotonic"). The calibrated model is logged to MLflow as a single artifact — the serving layer loads the calibrated wrapper, not the raw classifier.

If min_calib_samples is not reached, calibration is skipped and a warning is logged. This guard is relevant for very small smoke runs.


Reproducibility contract

Given the same: - git commit, - DVC dataset version (dvc pull), - params.yaml,

dvc repro produces identical results. This is enforced by DVC content-addressing, explicit random seed management in training code, and the Hydra config snapshot logged to each MLflow run.

dvc params diff and dvc status will show any deviation from the last locked state.


CI integration

The train:smoke GitLab CI job runs the full 22-stage pipeline with experiment=smoke overrides (toy data, 2 Optuna trials) to verify graph wiring on every relevant push. The train:test job runs a reduced-scale real-data pipeline and promotes the result to the smoke and optionally candidate aliases.

Contract tests in tests/contract/test_pipeline_contracts.py validate that stage input/output schemas match their declared contracts independent of a full DVC run.


Running the pipeline

# Full pipeline (data ingestion → inference → analysis)
dvc repro

# Training path only (assumes data artifacts are current)
dvc repro split_data tune_xgb tune_hgb select_model final_train register_model promote_model

# Inference + analysis only
dvc repro batch_inference analysis monitor_drift

# Inspect param changes since last run
dvc params diff

# View the live dependency graph
dvc dag