Quickstart — Reproducible Golden Path¶

This page shows how to reproduce the ML training pipeline locally from a clean checkout.

What this proves: dvc repro from any clean checkout gives deterministic results — same model, same metrics, tracked in MLflow.

Not covered here: live API demo → Demo Walkthrough; full local environment setup → Local Dev Runbook.

Prerequisites¶

Python 3.13
pdm (dependency management)
git
dvc
Access to DVC remote storage (read-only for demo)

1. Clone the repository¶

git clone <repository-url>
cd soccer

2. Install dependencies¶

Dependencies are managed via PDM with environment-specific groups.

# Install all dependencies
pdm install

# OR: Create conda environment
make env-install

This installs:

Data access and storage utilities
ML libraries (scikit-learn, XGBoost, MLflow)
Pipeline orchestration tools (DVC)
Development utilities (ruff, pytest)

3. Pull versioned datasets¶

All datasets are versioned using DVC.

dvc pull

This restores:

Raw parquet files (data/raw/)
Processed datasets (data/interim/)
Feature tables (data/features/)
Train/test splits (data/splits/)

What happens: DVC downloads data files from remote storage (MinIO S3) using content-addressed hashes.

Expected output:

A  data/raw/matches.parquet
A  data/interim/matches_finished.parquet
A  data/features/features.parquet
...

Expected time: 1–3 minutes depending on remote latency.

Access requirement: Read-only DVC remote credentials are required (dvc remote list to check configured remotes). If you don’t have credentials, run dvc repro --no-commit to recompute from source data only.

4. Run the ML pipeline¶

The full ML pipeline is orchestrated via DVC pipelines.

# Run full pipeline
dvc repro

# OR: Force re-run all stages
rm -f dvc.lock
dvc repro

Pipeline stages (see dvc.yaml — 22 stages total):

load_data_from_sources — fetch raw match data from PostgreSQL
load_odds_fdco — load historical odds (Football-Data.co.uk)
validate_raw — Great Expectations suite on raw data
export_metadata — extract match metadata
preprocessing — clean and filter matches
validate_finished / validate_future — schema checks per split
generate_features_meta / feature_engineering — time-windowed rolling stats
validate_features — feature schema validation
split_data — temporal train/test splits + walk-forward CV folds
classification_models — baseline model training
tune_xgb / tune_logreg / tune_hgb — Optuna hyperparameter search per model family
select_model — compare CV log-loss across tuned models; write best_model.json
final_train — retrain winner on full training data
register_model — register to MLflow Model Registry with candidate alias
promote_model — gate promotion; auto-promote to candidate; champion is manual
batch_inference — run predictions over all matches; write predictions.parquet
analysis — error analysis and slice diagnostics
monitor_drift — Evidently drift report vs champion reference; writes reports/drift/latest.json

Execution characteristics:

Deterministic: Same input → same output
Cached: Only re-runs changed stages
Traceable: All outputs tracked in dvc.lock

Expected output:

Running stage 'preprocessing':    > python src/pipelines/preprocess.py
Running stage 'feature_engineering': > python src/pipelines/feature_engineering.py
Running stage 'split_data':       > python src/pipelines/split_data.py
Running stage 'classification_baseline': ...
Running stage 'classification_models':  ...

Expected time: 5–20 minutes on first run; subsequent runs use DVC cache (seconds if unchanged).

Pipeline execution depends on:

Data versions (DVC tracked)
Code versions (Git tracked)
Configuration (params.yaml)

5. Inspect experiment results¶

Start the MLflow UI:

mlflow ui --port 5001

Open browser: http://localhost:5001

What to inspect:

Experiments: Browse all training runs
Parameters: Hyperparameters logged automatically
Metrics: Log-loss (primary), Brier score, ECE, ROC-AUC OvR, Accuracy
Artifacts: Confusion matrices, model files, plots
Runs comparison: Compare multiple models side-by-side

Example workflow:

Navigate to "Experiments" tab
Click on matches_clf experiment
Select multiple runs
Click "Compare"
View metric differences and charts

6. Verify reproducibility¶

# Check DVC status (should be clean after dvc repro)
dvc status

# View pipeline DAG
dvc dag

Key insight: Any checkout of the same git commit + dvc repro produces identical outputs.

What this demonstrates¶

Step	What it proves
`dvc pull`	Content-addressed versioning — any dataset version restorable by hash
`dvc repro`	Deterministic pipeline — same input + code → same output
`mlflow ui`	Full experiment traceability — parameters, metrics, and artifacts logged automatically
`dvc dag`	Explicit dependency tracking — stages and their inputs/outputs are declared

Currently supported in this path¶

Step	Status
`dvc pull` — restore versioned datasets	✅ Operational
`dvc repro` — run full ML pipeline	✅ Operational
`mlflow ui` — inspect experiments	✅ Operational
`pytest tests/` — run test suite (564 tests, `make test` recommended)	✅ Operational
Grafana dashboard	✅ Deployed — two dashboards in Helm chart (`soccer-api.json`, `soccer-ml-quality.json`)

Where to go next¶

Demo Walkthrough — live API walkthrough and interview script
Architecture Overview — system design and C4 diagrams
Implementation Status — full component readiness matrix

Troubleshooting¶

DVC remote access error: "No credentials found"¶

# Check configured remotes
dvc remote list

# Check if credentials file exists
cat ~/.aws/credentials  # or check DVC remote config

# Pull from a specific remote
dvc fetch --remote myremote

MLflow UI: no experiments visible¶

# Check tracking URI
echo $MLFLOW_TRACKING_URI

# Start with local filesystem store
mlflow ui --backend-store-uri ./mlruns --port 5001

DVC pipeline stage fails¶

# Show detailed error output
dvc repro --verbose 2>&1 | tail -50

# Force rerun a single stage
dvc repro -f preprocessing

# Reset pipeline state completely (use with caution)
rm -f dvc.lock && dvc repro

`pdm install` fails on a dependency¶

# Check Python version
python --version  # should be 3.13

# Use conda environment as baseline
make env-install
conda activate soccer
pdm install

Wrong results after `dvc repro`¶

# Verify all stage outputs are clean
dvc status

# Check params.yaml has not been modified
git diff params.yaml