Skip to content

Quickstart — Reproducible Golden Path

This page shows how to reproduce the ML training pipeline locally from a clean checkout.

What this proves: dvc repro from any clean checkout gives deterministic results — same model, same metrics, tracked in MLflow.

Not covered here: live API demo → Demo Walkthrough; full local environment setup → Local Dev Runbook.


Prerequisites

  • Python 3.13
  • pdm (dependency management)
  • git
  • dvc
  • Access to DVC remote storage (read-only for demo)

1. Clone the repository

git clone <repository-url>
cd soccer

2. Install dependencies

Dependencies are managed via PDM with environment-specific groups.

# Install all dependencies
pdm install

# OR: Create conda environment
make env-install

This installs:

  • Data access and storage utilities
  • ML libraries (scikit-learn, XGBoost, MLflow)
  • Pipeline orchestration tools (DVC)
  • Development utilities (ruff, pytest)

3. Pull versioned datasets

All datasets are versioned using DVC.

dvc pull

This restores:

  • Raw parquet files (data/raw/)
  • Processed datasets (data/interim/)
  • Feature tables (data/features/)
  • Train/test splits (data/splits/)

What happens: DVC downloads data files from remote storage (MinIO S3) using content-addressed hashes.

Expected output:

A  data/raw/matches.parquet
A  data/interim/matches_finished.parquet
A  data/features/features.parquet
...
Expected time: 1–3 minutes depending on remote latency.

Access requirement: Read-only DVC remote credentials are required (dvc remote list to check configured remotes). If you don’t have credentials, run dvc repro --no-commit to recompute from source data only.


4. Run the ML pipeline

The full ML pipeline is orchestrated via DVC pipelines.

# Run full pipeline
dvc repro

# OR: Force re-run all stages
rm -f dvc.lock
dvc repro

Pipeline stages (see dvc.yaml — 22 stages total):

  1. load_data_from_sources — fetch raw match data from PostgreSQL
  2. load_odds_fdco — load historical odds (Football-Data.co.uk)
  3. validate_raw — Great Expectations suite on raw data
  4. export_metadata — extract match metadata
  5. preprocessing — clean and filter matches
  6. validate_finished / validate_future — schema checks per split
  7. generate_features_meta / feature_engineering — time-windowed rolling stats
  8. validate_features — feature schema validation
  9. split_data — temporal train/test splits + walk-forward CV folds
  10. classification_models — baseline model training
  11. tune_xgb / tune_logreg / tune_hgb — Optuna hyperparameter search per model family
  12. select_model — compare CV log-loss across tuned models; write best_model.json
  13. final_train — retrain winner on full training data
  14. register_model — register to MLflow Model Registry with candidate alias
  15. promote_model — gate promotion; auto-promote to candidate; champion is manual
  16. batch_inference — run predictions over all matches; write predictions.parquet
  17. analysis — error analysis and slice diagnostics
  18. monitor_drift — Evidently drift report vs champion reference; writes reports/drift/latest.json

Execution characteristics:

  • Deterministic: Same input → same output
  • Cached: Only re-runs changed stages
  • Traceable: All outputs tracked in dvc.lock

Expected output:

Running stage 'preprocessing':    > python src/pipelines/preprocess.py
Running stage 'feature_engineering': > python src/pipelines/feature_engineering.py
Running stage 'split_data':       > python src/pipelines/split_data.py
Running stage 'classification_baseline': ...
Running stage 'classification_models':  ...
Expected time: 5–20 minutes on first run; subsequent runs use DVC cache (seconds if unchanged).

Pipeline execution depends on:

  • Data versions (DVC tracked)
  • Code versions (Git tracked)
  • Configuration (params.yaml)

5. Inspect experiment results

Start the MLflow UI:

mlflow ui --port 5001

Open browser: http://localhost:5001

What to inspect:

  • Experiments: Browse all training runs
  • Parameters: Hyperparameters logged automatically
  • Metrics: Log-loss (primary), Brier score, ECE, ROC-AUC OvR, Accuracy
  • Artifacts: Confusion matrices, model files, plots
  • Runs comparison: Compare multiple models side-by-side

Example workflow:

  1. Navigate to "Experiments" tab
  2. Click on matches_clf experiment
  3. Select multiple runs
  4. Click "Compare"
  5. View metric differences and charts

6. Verify reproducibility

# Check DVC status (should be clean after dvc repro)
dvc status

# View pipeline DAG
dvc dag

Key insight: Any checkout of the same git commit + dvc repro produces identical outputs.


What this demonstrates

Step What it proves
dvc pull Content-addressed versioning — any dataset version restorable by hash
dvc repro Deterministic pipeline — same input + code → same output
mlflow ui Full experiment traceability — parameters, metrics, and artifacts logged automatically
dvc dag Explicit dependency tracking — stages and their inputs/outputs are declared

Currently supported in this path

Step Status
dvc pull — restore versioned datasets ✅ Operational
dvc repro — run full ML pipeline ✅ Operational
mlflow ui — inspect experiments ✅ Operational
pytest tests/ — run test suite (564 tests, make test recommended) ✅ Operational
Grafana dashboard 📋 Not yet deployed

Where to go next


Troubleshooting

DVC remote access error: "No credentials found"

# Check configured remotes
dvc remote list

# Check if credentials file exists
cat ~/.aws/credentials  # or check DVC remote config

# Pull from a specific remote
dvc fetch --remote myremote

MLflow UI: no experiments visible

# Check tracking URI
echo $MLFLOW_TRACKING_URI

# Start with local filesystem store
mlflow ui --backend-store-uri ./mlruns --port 5001

DVC pipeline stage fails

# Show detailed error output
dvc repro --verbose 2>&1 | tail -50

# Force rerun a single stage
dvc repro -f preprocessing

# Reset pipeline state completely (use with caution)
rm -f dvc.lock && dvc repro

pdm install fails on a dependency

# Check Python version
python --version  # should be 3.13

# Use conda environment as baseline
make env-install
conda activate soccer
pdm install

Wrong results after dvc repro

# Verify all stage outputs are clean
dvc status

# Check params.yaml has not been modified
git diff params.yaml