Skip to content

Data Contracts & Versioning

This page covers two complementary concerns that together guarantee reproducibility and correctness for the ML pipeline:

  1. Contracts — what "valid data" means at each stage boundary (Great Expectations)
  2. Versioning — how datasets are content-addressed and experiments made reproducible (DVC)

Data Contracts & Quality Gates

Purpose

Given an unstable upstream source, data validity cannot be assumed. Data contracts formalize what "valid data" means at each stage boundary. A failed contract stops the pipeline — no downstream stage runs on data that has not passed its gate.

Status: ✅ Implemented — all four GE suites are active DVC stage gates.

Implemented gates

DVC stage Suite module Dataset validated
validate_raw src/data_quality/raw.py data/raw/match_raw.parquet
validate_finished src/data_quality/finished.py data/interim/finished.parquet
validate_future src/data_quality/future.py data/interim/future.parquet
validate_features src/data_quality/features.py data/features/features.parquet

What each suite checks

validate_raw — validates match_raw.parquet before any preprocessing: - Required columns present: id, homeTeamId, awayTeamId, startTimeUtc, regionId, tournamentId, seasonId, stageId, sex, status - Not-null on: id, homeTeamId, awayTeamId, startTimeUtc, status - startTimeUtc in range: 1990-01-01 to 2035-12-31 - status values from the known API code set

validate_finished / validate_future — validates preprocessing output: - Schema integrity after column stripping and type casting - outcome_1x2 ∈ {0, 1, 2} for finished matches - No future-match rows in the finished set (temporal constraint) - Score columns within plausible clipped range

validate_features — validates the engineered feature matrix: - Rate columns (win_mean, draw_mean, loss_mean) bounded within [0.0, 1.0] with mostly=0.99 - Coverage columns in [0.0, 1.0] - Goals rolling averages non-negative - H2H columns allow high null rates (many team pairs have no head-to-head history)

Blocking vs non-blocking

All currently implemented checks are blocking: any expectation failure causes the DVC stage to exit non-zero, stopping the pipeline.

Distribution drift checks (statistical) are 📋 Planned. Evidently integration for drift monitoring is a planned capability — see Monitoring.

Contract as code

GE suite modules in src/data_quality/ are: - versioned in Git alongside pipeline code, - deterministic pure functions (no IO, no side effects), - invoked by DVC stages in dvc.yaml.

Schema drift consequence

When WhoScored changes its output structure: 1. Scraping produces records that fail the validate_raw suite. 2. dvc repro stops at the validate_raw stage. 3. No downstream processing, feature engineering, or training runs. 4. Operator reviews the change and either updates the GE suite or fixes the scraper.

Contract ownership

  • Contracts live in src/data_quality/ and are reviewed as code changes.
  • A contract change that widens or removes a check must be justified.
  • Breaking schema changes (column renames, type changes) require updating both the contract suite and downstream preprocessing code before dvc repro.

Dataset Versioning & Reproducibility

What is versioned and by what tool

Artifact Tool Where stored What is tracked
Raw parquet snapshots (data/raw/) DVC MinIO (S3-compatible) Content hash → Git .dvc pointer
Interim datasets (data/interim/) DVC MinIO Content hash → Git .dvc pointer
Feature matrix (data/features/) DVC MinIO Content hash → Git .dvc pointer
Pipeline stage definitions Git Repository dvc.yaml stage graph
Parameters (params.yaml) Git Repository Input to DVC stages; tracked as dep
ML experiment runs MLflow MLflow Tracking Server Metrics, params, artifacts per run
Registered models MLflow Registry MLflow + MinIO Model artifacts under versioned alias

DVC manages data. Git manages code, config, and DVC pointer files. MLflow manages experiment runs and model lifecycle.

How DVC versioning works

Every parquet file tracked by DVC has a corresponding .dvc pointer file in Git recording the content hash and the MinIO remote path. When a DVC stage runs:

  1. It computes a hash of its declared outputs.
  2. If the hash matches what is in the Git-tracked .dvc file, the stage is skipped.
  3. If the hash differs (or is missing), the stage re-runs and updates the .dvc file.

A given Git commit uniquely identifies the dataset state for that pipeline run.

Reproducibility guarantee

Given a Git commit hash, access to the MinIO remote, and the same params.yaml at that commit, any past experiment can be reproduced:

git checkout <commit>
dvc pull
dvc repro

This holds because: - all pipeline inputs (parquet files, params) are content-addressed, - all feature and preprocessing logic is deterministic (pure functions, no random state in data stages), - all randomness in model training is seeded via params.yaml.

How DVC and MLflow connect

Each MLflow experiment run records the DVC dataset version, training parameters, and metrics. Following the chain:

MLflow model version
    → MLflow run ID
    → params.yaml commit
    → DVC .dvc pointer files at that commit
    → MinIO content-addressed parquet

Restore semantics

# Restore to a specific git commit
git checkout <commit>
dvc pull               # download exact data matching that commit from MinIO

# Verify pipeline state
dvc status             # should report "Data and pipelines are up to date"

# Reproduce the full pipeline from that point
dvc repro

What is NOT versioned by DVC

  • PostgreSQL data — the raw parquet export is the version boundary.
  • GE validation reports — HTML artifacts are produced per run but not tracked as versioned datasets.
  • Metadata JSON files (data/metadata/) — regenerated on each preprocessing run.

Backfills

Backfills are an operator-driven manual process. See Runbook: Backfills for the full procedure and safety invariants.

A backfill changes data in PostgreSQL. On the next dvc repro, load_data_from_sources produces a parquet file with a different content hash → DVC updates the .dvc pointer → downstream stages re-run → new MLflow experiment run produced. The previous DVC version remains in MinIO (rollback path).

Safety invariants: never mutate existing DVC-tracked datasets in-place; always re-run downstream ML stages explicitly; always validate that GE contracts pass on backfilled data.