Training Pipeline (DVC)¶
Purpose¶
Document the end-to-end ML pipeline — from raw data ingestion through feature engineering,
model training, registry promotion, and batch inference — as it is actually defined in dvc.yaml.
Data sources and ETL internals are described in Data. Model registry lifecycle is in Model Registry.
Pipeline at a glance¶
The pipeline is defined in dvc.yaml as 22 DVC stages organised into six logical groups:
| Group | Stages | Purpose |
|---|---|---|
| Ingestion | load_data_from_sources, load_odds_fdco |
Pull versioned data from MinIO and external odds source |
| Validation & Prep | validate_raw, export_metadata, preprocessing, validate_finished, validate_future |
Clean data, extract lookups, GE contract checks |
| Features | generate_features_meta, feature_engineering, validate_features |
ELO + rolling stats, feature metadata, GE feature check |
| Training | split_data, classification_models, tune_xgb, tune_logreg, tune_hgb, select_model, final_train |
Time-based split, Optuna tuning, model selection, calibrated final train |
| Registry | register_model, promote_model |
Register to MLflow, auto-gate candidate alias |
| Serving & Analysis | batch_inference, analysis, monitor_drift |
Pre-compute predictions, error/ROI analysis, Evidently drift |
Run
dvc dagfor the live dependency graph.
DAG diagram¶
flowchart TD
subgraph ing ["Ingestion"]
SRC([load_data_from_sources])
ODD([load_odds_fdco])
end
subgraph prep ["Validation & Prep"]
VR[validate_raw]
EM[export_metadata]
PP[preprocessing]
VFI[validate_finished]
VFU[validate_future]
end
subgraph feat ["Features"]
GFM[generate_features_meta]
FE[feature_engineering]
VFF[validate_features]
end
subgraph train ["Training"]
SD[split_data]
CM[classification_models]
TX[tune_xgb]
TH[tune_hgb]
TL[tune_logreg ⚠️]
SEL[select_model]
FT[final_train]
end
subgraph reg ["Registry"]
RM[register_model]
PM[promote_model]
end
subgraph srv ["Serving & Analysis"]
BI[batch_inference]
AN[analysis]
MD[monitor_drift]
end
SRC --> VR --> PP
SRC --> EM
PP --> VFI & VFU
PP --> FE
GFM --> VFF
FE --> VFF --> SD
SD --> CM & TX & TH & TL
TX & TH & TL --> SEL
SEL & CM --> FT
FT --> RM --> PM
PP & FE --> BI
BI --> AN & MD
classDef validate fill:#e8f4e8,stroke:#4a7c4a,color:#1a3a1a
classDef ml fill:#e8e8f4,stroke:#4a4a7c,color:#1a1a3a
classDef disabled fill:#f4f0e8,stroke:#9a8060,color:#6a5030,stroke-dasharray:4 4
classDef registry fill:#f0e8f4,stroke:#6a4a7c,color:#3a1a4a
class VR,VFI,VFU,VFF validate
class CM,TX,TH,SEL,FT ml
class TL disabled
class RM,PM registry
⚠️
tune_logregis in the DAG but runs withtuning_logreg.enabled: falsein the current production config — it produces a placeholder output and is excluded fromselect_model. See Production configuration below.
Stage reference¶
Ingestion & validation¶
| Stage | Role | Key outputs |
|---|---|---|
load_data_from_sources |
Pull match data from MinIO (WhoScored.com scrape) | data/raw/match.parquet, data/raw/match_raw.parquet |
load_odds_fdco |
Download football-data.co.uk historical odds | data/raw/odds_fdco.parquet |
validate_raw |
Great Expectations check on raw match data | data/evaluation/ge_raw.json |
export_metadata |
Extract team/tournament/region/season lookup maps | data/metadata/*.json (6 lookup files) |
preprocessing |
Clean matches, clip score outliers, partition into finished vs future | data/interim/finished.parquet, data/interim/future.parquet |
validate_finished |
GE check on finished-match data | data/evaluation/ge_finished.json |
validate_future |
GE check on upcoming-match data | data/evaluation/ge_future.json |
Feature engineering¶
| Stage | Role | Key outputs |
|---|---|---|
generate_features_meta |
Build feature column metadata from params (no data dependency) | data/features/features_meta.parquet |
feature_engineering |
Compute ELO ratings + rolling stats on finished matches | data/features/features.parquet |
validate_features |
GE check on feature matrix | data/evaluation/ge_features.json |
Training path¶
| Stage | Role | Key outputs |
|---|---|---|
split_data |
Time-based train/test split (test ≥ 2024-01-01) + walk-forward CV folds (2016–2024) | data/processed/dataset.parquet, data/splits/{train,test,folds}.parquet |
classification_models |
Baseline exploration: runs multiple classifier types at 1% data fraction; logs to MLflow | data/models/run_id.json |
tune_xgb |
Optuna search over XGBoost hyperparameters using walk-forward CV (10% data, 20 trials) | data/models/xgb_best_params.json |
tune_hgb |
Optuna search over HistGradientBoosting (10% data, 20 trials) | data/models/hgb_best_params.json |
tune_logreg |
Optuna search over LogisticRegression — disabled in production (tuning_logreg.enabled: false) |
data/models/logreg_best_params.json (placeholder) |
select_model |
Compare tuned candidates by CV log-loss; writes winner to best_model.json |
data/models/best_model.json |
final_train |
Train winner on full train set; hold out test set for one-shot evaluation; apply isotonic calibration | MLflow run; data/models/final_run_id.json |
Registry¶
| Stage | Role | Key outputs |
|---|---|---|
register_model |
Create/update MLflow Registry entry; assign initial smoke alias |
data/models/registered_model.json |
promote_model |
Quality gate: promote to candidate alias if final.logloss ≤ current_candidate + 0.002 |
data/models/promoted_model.json |
champion promotion is manual-only — see Model Registry.
Serving & analysis¶
| Stage | Role | Key outputs |
|---|---|---|
batch_inference |
Load champion model; compute features for upcoming + recent matches; upload to MinIO |
data/predictions/match_features.parquet, data/predictions/predictions.parquet |
analysis |
Error analysis (per-segment log-loss, calibration) + flat-stake ROI simulation against odds_fdco |
reports/error_analysis/, reports/roi_analysis/, data/analysis/ |
monitor_drift |
Evidently dataset drift report on match_features.parquet; writes Prometheus textfile |
reports/drift/latest.json, reports/drift/metrics.prom |
Production configuration¶
All stage behaviour is driven by params.yaml (generated by Hydra from conf/config.yaml).
DVC tracks per-stage param consumption, so dvc params diff shows exactly what changed.
# params.yaml — relevant ML sections
temporal:
test_start: "2024-01-01" # holdout lower bound
folds_start_year: 2016 # first CV fold validation year
folds_end_year: 2024 # exclusive upper bound
classification:
target_col: outcome_1x2
frac: 0.01 # 1% sample — exploratory baseline only
side: [home, diff, away]
cat_cols: [sex]
include_elo: false # ELO excluded from the baseline exploration stage
include_rest_days: false
include_h2h: false
features_selected: # feature set used for tuning + final_train
side: [home, diff, away]
window_sizes: [1, 2, 3, 5, 7, 10, 12]
include_elo: true # ELO enabled for tuned models
include_rest_days: false # rest-days features implemented but not currently active
include_h2h: false # H2H features implemented but not currently active
class_weight: {0: 1.0, 1: 1.25, 2: 1.0}
tuning:
n_trials: 20
frac: 0.1 # 10% sample for Optuna walk-forward CV
tuning_logreg:
enabled: false # LogReg tuning disabled; XGB + HGB are the active candidates
select_model:
experiment_name: matches_clf_v1.0_select
final_train:
calibration:
enabled: true
method: isotonic
calib_frac: 0.15 # 15% of train set reserved for calibration
min_calib_samples: 100
register_model:
model_name: soccer-match-outcome
model_stage: smoke # initial alias; never serving-ready on its own
promote_model:
metric: final.logloss
tolerance: 0.002
candidate_alias: candidate
inference:
model_name: soccer-match-outcome
model_stage: champion # batch_inference always loads the champion alias
Feature nuance:
include_rest_daysandinclude_h2harefalsein the production config. Both feature families are fully implemented insrc/features/stats_matches.pyand can be re-enabled viaparams.yaml. They are excluded from the current production model because ablation showed marginal or noisy contribution at the current data scale.
Experiment overlays (Hydra)¶
Pipeline scale is controlled by Hydra overlays in conf/experiment/. They override only
the keys they need — the rest falls through to conf/config.yaml.
| Overlay | Trigger | Key overrides | Purpose |
|---|---|---|---|
experiment=smoke |
CI train:smoke job |
frac=0.001, n_trials=2, model_stage=ci-smoke |
Wiring check only — confirms all 22 stages execute without error; toy model never used by serving |
experiment=test |
CI train:test job |
tuning.frac=0.2, n_trials=20 |
Reduced-scale real-data run; promotes to smoke → candidate |
| (base) | dvc repro default |
Full params.yaml |
Production run |
Usage:
dvc exp run -S 'experiment=smoke' # CI wiring check
dvc exp run -S 'experiment=test' # reduced-scale training
dvc repro # full production run
Calibration¶
final_train splits the training set: 85% trains the model, 15% calibrates it.
train_ids
├─ 85% → model fit (XGBoost / HGB with tuned params)
└─ 15% → isotonic regression calibration (min 100 samples required)
Calibration is applied post-fit via CalibratedClassifierCV(method="isotonic").
The calibrated model is logged to MLflow as a single artifact — the serving layer
loads the calibrated wrapper, not the raw classifier.
If min_calib_samples is not reached, calibration is skipped and a warning is logged.
This guard is relevant for very small smoke runs.
Reproducibility contract¶
Given the same:
- git commit,
- DVC dataset version (dvc pull),
- params.yaml,
dvc repro produces identical results. This is enforced by DVC content-addressing,
explicit random seed management in training code, and the Hydra config snapshot logged
to each MLflow run.
dvc params diff and dvc status will show any deviation from the last locked state.
CI integration¶
The train:smoke GitLab CI job runs the full 22-stage pipeline with experiment=smoke
overrides (toy data, 2 Optuna trials) to verify graph wiring on every relevant push.
The train:test job runs a reduced-scale real-data pipeline and promotes the result to
the smoke and optionally candidate aliases.
Contract tests in tests/contract/test_pipeline_contracts.py validate that stage
input/output schemas match their declared contracts independent of a full DVC run.
Running the pipeline¶
# Full pipeline (data ingestion → inference → analysis)
dvc repro
# Training path only (assumes data artifacts are current)
dvc repro split_data tune_xgb tune_hgb select_model final_train register_model promote_model
# Inference + analysis only
dvc repro batch_inference analysis monitor_drift
# Inspect param changes since last run
dvc params diff
# View the live dependency graph
dvc dag
Related¶
- Features — feature families, leakage prevention, offline/online parity
- Tuning — Optuna walk-forward CV details
- MLflow — experiment tracking and run tagging
- Model Registry — alias lifecycle, promotion policy, rollback
- Validation Strategy — Great Expectations suites at each stage
- Data: Versioning — DVC artifact versioning
- Architecture: Data & ML Flow