Data Contracts & Versioning¶
This page covers two complementary concerns that together guarantee reproducibility and correctness for the ML pipeline:
- Contracts — what "valid data" means at each stage boundary (Great Expectations)
- Versioning — how datasets are content-addressed and experiments made reproducible (DVC)
Data Contracts & Quality Gates¶
Purpose¶
Given an unstable upstream source, data validity cannot be assumed. Data contracts formalize what "valid data" means at each stage boundary. A failed contract stops the pipeline — no downstream stage runs on data that has not passed its gate.
Status: ✅ Implemented — all four GE suites are active DVC stage gates.
Implemented gates¶
| DVC stage | Suite module | Dataset validated |
|---|---|---|
validate_raw |
src/data_quality/raw.py |
data/raw/match_raw.parquet |
validate_finished |
src/data_quality/finished.py |
data/interim/finished.parquet |
validate_future |
src/data_quality/future.py |
data/interim/future.parquet |
validate_features |
src/data_quality/features.py |
data/features/features.parquet |
What each suite checks¶
validate_raw — validates match_raw.parquet before any preprocessing:
- Required columns present: id, homeTeamId, awayTeamId, startTimeUtc, regionId, tournamentId, seasonId, stageId, sex, status
- Not-null on: id, homeTeamId, awayTeamId, startTimeUtc, status
- startTimeUtc in range: 1990-01-01 to 2035-12-31
- status values from the known API code set
validate_finished / validate_future — validates preprocessing output:
- Schema integrity after column stripping and type casting
- outcome_1x2 ∈ {0, 1, 2} for finished matches
- No future-match rows in the finished set (temporal constraint)
- Score columns within plausible clipped range
validate_features — validates the engineered feature matrix:
- Rate columns (win_mean, draw_mean, loss_mean) bounded within [0.0, 1.0] with mostly=0.99
- Coverage columns in [0.0, 1.0]
- Goals rolling averages non-negative
- H2H columns allow high null rates (many team pairs have no head-to-head history)
Blocking vs non-blocking¶
All currently implemented checks are blocking: any expectation failure causes the DVC stage to exit non-zero, stopping the pipeline.
Distribution drift checks (statistical) are 📋 Planned. Evidently integration for drift monitoring is a planned capability — see Monitoring.
Contract as code¶
GE suite modules in src/data_quality/ are:
- versioned in Git alongside pipeline code,
- deterministic pure functions (no IO, no side effects),
- invoked by DVC stages in dvc.yaml.
Schema drift consequence¶
When WhoScored changes its output structure:
1. Scraping produces records that fail the validate_raw suite.
2. dvc repro stops at the validate_raw stage.
3. No downstream processing, feature engineering, or training runs.
4. Operator reviews the change and either updates the GE suite or fixes the scraper.
Contract ownership¶
- Contracts live in
src/data_quality/and are reviewed as code changes. - A contract change that widens or removes a check must be justified.
- Breaking schema changes (column renames, type changes) require updating both
the contract suite and downstream preprocessing code before
dvc repro.
Dataset Versioning & Reproducibility¶
What is versioned and by what tool¶
| Artifact | Tool | Where stored | What is tracked |
|---|---|---|---|
Raw parquet snapshots (data/raw/) |
DVC | MinIO (S3-compatible) | Content hash → Git .dvc pointer |
Interim datasets (data/interim/) |
DVC | MinIO | Content hash → Git .dvc pointer |
Feature matrix (data/features/) |
DVC | MinIO | Content hash → Git .dvc pointer |
| Pipeline stage definitions | Git | Repository | dvc.yaml stage graph |
Parameters (params.yaml) |
Git | Repository | Input to DVC stages; tracked as dep |
| ML experiment runs | MLflow | MLflow Tracking Server | Metrics, params, artifacts per run |
| Registered models | MLflow Registry | MLflow + MinIO | Model artifacts under versioned alias |
DVC manages data. Git manages code, config, and DVC pointer files. MLflow manages experiment runs and model lifecycle.
How DVC versioning works¶
Every parquet file tracked by DVC has a corresponding .dvc pointer file in Git recording the
content hash and the MinIO remote path. When a DVC stage runs:
- It computes a hash of its declared outputs.
- If the hash matches what is in the Git-tracked
.dvcfile, the stage is skipped. - If the hash differs (or is missing), the stage re-runs and updates the
.dvcfile.
A given Git commit uniquely identifies the dataset state for that pipeline run.
Reproducibility guarantee¶
Given a Git commit hash, access to the MinIO remote, and the same params.yaml at that commit,
any past experiment can be reproduced:
This holds because:
- all pipeline inputs (parquet files, params) are content-addressed,
- all feature and preprocessing logic is deterministic (pure functions, no random state in data stages),
- all randomness in model training is seeded via params.yaml.
How DVC and MLflow connect¶
Each MLflow experiment run records the DVC dataset version, training parameters, and metrics. Following the chain:
MLflow model version
→ MLflow run ID
→ params.yaml commit
→ DVC .dvc pointer files at that commit
→ MinIO content-addressed parquet
Restore semantics¶
# Restore to a specific git commit
git checkout <commit>
dvc pull # download exact data matching that commit from MinIO
# Verify pipeline state
dvc status # should report "Data and pipelines are up to date"
# Reproduce the full pipeline from that point
dvc repro
What is NOT versioned by DVC¶
- PostgreSQL data — the raw parquet export is the version boundary.
- GE validation reports — HTML artifacts are produced per run but not tracked as versioned datasets.
- Metadata JSON files (
data/metadata/) — regenerated on each preprocessing run.
Backfills¶
Backfills are an operator-driven manual process. See Runbook: Backfills for the full procedure and safety invariants.
A backfill changes data in PostgreSQL. On the next dvc repro, load_data_from_sources produces
a parquet file with a different content hash → DVC updates the .dvc pointer → downstream stages
re-run → new MLflow experiment run produced. The previous DVC version remains in MinIO (rollback path).
Safety invariants: never mutate existing DVC-tracked datasets in-place; always re-run downstream ML stages explicitly; always validate that GE contracts pass on backfilled data.
Related¶
- Ingestion Pipeline — where data originates
- Canonical Datasets & Lineage — dataset schemas
- Architecture: Data & ML Flow — gate positions in the pipeline
- Architecture: Failure Modes — what happens when a gate fails