Data Contracts & Versioning¶

This page covers two complementary concerns that together guarantee reproducibility and correctness for the ML pipeline:

Contracts — what "valid data" means at each stage boundary (Great Expectations)
Versioning — how datasets are content-addressed and experiments made reproducible (DVC)

Data Contracts & Quality Gates¶

Purpose¶

Given an unstable upstream source, data validity cannot be assumed. Data contracts formalize what "valid data" means at each stage boundary. A failed contract stops the pipeline — no downstream stage runs on data that has not passed its gate.

Status: ✅ Implemented — all four GE suites are active DVC stage gates.

Implemented gates¶

DVC stage	Suite module	Dataset validated
`validate_raw`	`src/data_quality/raw.py`	`data/raw/match_raw.parquet`
`validate_finished`	`src/data_quality/finished.py`	`data/interim/finished.parquet`
`validate_future`	`src/data_quality/future.py`	`data/interim/future.parquet`
`validate_features`	`src/data_quality/features.py`	`data/features/features.parquet`

What each suite checks¶

validate_raw — validates match_raw.parquet before any preprocessing: - Required columns present: id, homeTeamId, awayTeamId, startTimeUtc, regionId, tournamentId, seasonId, stageId, sex, status - Not-null on: id, homeTeamId, awayTeamId, startTimeUtc, status - startTimeUtc in range: 1990-01-01 to 2035-12-31 - status values from the known API code set

validate_finished / validate_future — validates preprocessing output: - Schema integrity after column stripping and type casting - outcome_1x2 ∈ {0, 1, 2} for finished matches - No future-match rows in the finished set (temporal constraint) - Score columns within plausible clipped range

validate_features — validates the engineered feature matrix: - Rate columns (win_mean, draw_mean, loss_mean) bounded within [0.0, 1.0] with mostly=0.99 - Coverage columns in [0.0, 1.0] - Goals rolling averages non-negative - H2H columns allow high null rates (many team pairs have no head-to-head history)

Blocking vs non-blocking¶

All currently implemented checks are blocking: any expectation failure causes the DVC stage to exit non-zero, stopping the pipeline.

Distribution drift checks (statistical) are 📋 Planned. Evidently integration for drift monitoring is a planned capability — see Monitoring.

Contract as code¶

GE suite modules in src/data_quality/ are: - versioned in Git alongside pipeline code, - deterministic pure functions (no IO, no side effects), - invoked by DVC stages in dvc.yaml.

Schema drift consequence¶

When WhoScored changes its output structure: 1. Scraping produces records that fail the validate_raw suite. 2. dvc repro stops at the validate_raw stage. 3. No downstream processing, feature engineering, or training runs. 4. Operator reviews the change and either updates the GE suite or fixes the scraper.

Contract ownership¶

Contracts live in src/data_quality/ and are reviewed as code changes.
A contract change that widens or removes a check must be justified.
Breaking schema changes (column renames, type changes) require updating both the contract suite and downstream preprocessing code before dvc repro.

Dataset Versioning & Reproducibility¶

What is versioned and by what tool¶

Artifact	Tool	Where stored	What is tracked
Raw parquet snapshots (`data/raw/`)	DVC	MinIO (S3-compatible)	Content hash → Git `.dvc` pointer
Interim datasets (`data/interim/`)	DVC	MinIO	Content hash → Git `.dvc` pointer
Feature matrix (`data/features/`)	DVC	MinIO	Content hash → Git `.dvc` pointer
Pipeline stage definitions	Git	Repository	`dvc.yaml` stage graph
Parameters (`params.yaml`)	Git	Repository	Input to DVC stages; tracked as dep
ML experiment runs	MLflow	MLflow Tracking Server	Metrics, params, artifacts per run
Registered models	MLflow Registry	MLflow + MinIO	Model artifacts under versioned alias

DVC manages data. Git manages code, config, and DVC pointer files. MLflow manages experiment runs and model lifecycle.

How DVC versioning works¶

Every parquet file tracked by DVC has a corresponding .dvc pointer file in Git recording the content hash and the MinIO remote path. When a DVC stage runs:

It computes a hash of its declared outputs.
If the hash matches what is in the Git-tracked .dvc file, the stage is skipped.
If the hash differs (or is missing), the stage re-runs and updates the .dvc file.

A given Git commit uniquely identifies the dataset state for that pipeline run.

Reproducibility guarantee¶

Given a Git commit hash, access to the MinIO remote, and the same params.yaml at that commit, any past experiment can be reproduced:

git checkout <commit>
dvc pull
dvc repro

This holds because: - all pipeline inputs (parquet files, params) are content-addressed, - all feature and preprocessing logic is deterministic (pure functions, no random state in data stages), - all randomness in model training is seeded via params.yaml.

How DVC and MLflow connect¶

Each MLflow experiment run records the DVC dataset version, training parameters, and metrics. Following the chain:

MLflow model version
    → MLflow run ID
    → params.yaml commit
    → DVC .dvc pointer files at that commit
    → MinIO content-addressed parquet

Restore semantics¶

# Restore to a specific git commit
git checkout <commit>
dvc pull               # download exact data matching that commit from MinIO

# Verify pipeline state
dvc status             # should report "Data and pipelines are up to date"

# Reproduce the full pipeline from that point
dvc repro

What is NOT versioned by DVC¶

PostgreSQL data — the raw parquet export is the version boundary.
GE validation reports — HTML artifacts are produced per run but not tracked as versioned datasets.
Metadata JSON files (data/metadata/) — regenerated on each preprocessing run.

Backfills¶

Backfills are an operator-driven manual process. See Runbook: Backfills for the full procedure and safety invariants.

A backfill changes data in PostgreSQL. On the next dvc repro, load_data_from_sources produces a parquet file with a different content hash → DVC updates the .dvc pointer → downstream stages re-run → new MLflow experiment run produced. The previous DVC version remains in MinIO (rollback path).

Safety invariants: never mutate existing DVC-tracked datasets in-place; always re-run downstream ML stages explicitly; always validate that GE contracts pass on backfilled data.

Ingestion Pipeline — where data originates
Canonical Datasets & Lineage — dataset schemas
Architecture: Data & ML Flow — gate positions in the pipeline
Architecture: Failure Modes — what happens when a gate fails