Model Registry & Promotion¶

Status: ✅ Implemented — Registration and candidate promotion automated via DVC pipeline; candidate → champion gate is manual.

Purpose¶

Document how models move from training into serving, what the lifecycle stages mean, how promotion is gated, and how rollback works.

Role of the registry¶

The MLflow Model Registry is the single handoff point between training and serving. The serving layer (PredictionService) loads the model from the registry by name and alias. No model reaches production without passing through this boundary.

Alias scheme (4 levels)¶

Alias	Meaning	Gate	Who sets it	CI job
`ci-smoke`	Toy model (`frac=0.001`, `n_trials=2`) — pipeline wiring check only; never used by serving	None — always assigned	`register_model` DVC stage	`train:smoke`
`smoke`	Real-data model; full feature set, reduced trials — lifecycle entry point	None — always assigned	`register_model` DVC stage	`train:test`
`candidate`	Passed quality gate; ready for manual review	`final.logloss ≤ current_candidate + 0.002`	`promote_model` DVC stage	`train:test`
`champion`	Currently serving live predictions	Manual sign-off (see Promotion Policy)	Developer / scheduled DAG	—

ci-smoke is set by the experiment=smoke Hydra overlay (conf/experiment/smoke.yaml). All other aliases use the base config (conf/config.yaml).

A single model version can carry multiple aliases simultaneously (e.g. a new version becomes smoke immediately, then candidate after the gate passes, then champion after manual review).

Stage 1: Registration (`register_model`)¶

The register_model DVC stage is the final automated step after final_train. It reads data/models/final_run_id.json, creates or updates the registered model, and assigns the initial alias to the new version:

ci-smoke — when run from train:smoke CI job (experiment=smoke, toy data)
smoke — when run from train:test CI job (real data, lifecycle entry point)

This operation is idempotent: re-running with the same run ID is safe.

The model name and initial alias are controlled via params.yaml (register_model.model_name, register_model.model_stage).

Stage 2: Candidate promotion (`promote_model`)¶

The promote_model DVC stage runs after register_model. It fetches final.logloss for the new version and compares it to the current candidate alias:

new_logloss ≤ current_candidate_logloss + tolerance  →  sets 'candidate' alias
new_logloss  > current_candidate_logloss + tolerance  →  logs warning, no change

If no current candidate exists (fresh registry), promotion always proceeds. A gate failure does not fail the DVC pipeline — it is an expected outcome.

Parameters (in params.yaml under promote_model):

Parameter	Default	Meaning
`metric`	`final.logloss`	MLflow metric key to compare
`tolerance`	`0.005`	Max allowed degradation vs current candidate
`candidate_alias`	`candidate`	Alias name to assign on pass

Result is written to data/models/promoted_model.json.

Stage 3: Champion promotion (manual)¶

Promoting from candidate to champion requires manual approval. This is a deliberate quality gate — not a missing feature.

Hard gates (all three must pass)¶

Gate	Threshold	Rationale
Log-loss ≤ `logreg_full` baseline	log-loss must not exceed logistic regression on full features	Ensures the proposed champion is strictly better than the simplest meaningful baseline
Brier ≤ current champion − 0.002	challenger Brier at least 0.002 lower than champion	Forces a meaningful improvement, not just noise
ECE ≤ 0.05	calibration error ≤ 5%	Ensures probabilities are usable for downstream betting simulation and API consumers

# Check candidate vs champion metrics in MLflow CLI
mlflow runs get --run-id <final_run_id>

Key metrics: final.logloss, final.brier, final.ece.

Manual review checklist¶

[ ] All hard gates pass (logloss, brier, ECE)
[ ] Calibration curve artifact (calibration_curves.png) — no severe over/under-confidence in any class
[ ] Confusion matrix (confusion_matrix_final.png) — no unexpected degradation on specific outcome classes
[ ] Segment metrics (segment_metrics.csv) — no severe performance drop in major leagues vs previous champion
[ ] Error analysis report (reports/error_analysis/) — no newly introduced systematic error pattern
[ ] Holdout slice is strictly 2024+ (no train/test contamination)
[ ] Calibration used temporal split (not random split)

Why `champion` promotion is NOT automated¶

Dataset size — the football match dataset is small enough that a single bad season can produce gate-passing metrics by chance. Human review of segment metrics provides a sanity check that automated thresholds cannot.
Calibration matters for downstream use — the API serves raw probabilities. A model with ECE just below 0.05 but with structural miscalibration in specific outcome classes should be caught in manual review.
Operational safety — for a portfolio project with a live demo, a human checkpoint before swapping the champion model is appropriate.

Promotion procedure¶

Verify all hard gates pass.
Complete the manual review checklist.
Run:

import mlflow
client = mlflow.MlflowClient()
candidate = client.get_model_version_by_alias("soccer-match-outcome", "candidate")
client.set_registered_model_alias("soccer-match-outcome", "champion", candidate.version)
print(f"champion → version {candidate.version}")

Restart the serving worker to reload the new model:

kubectl rollout restart deployment/soccer-worker -n soccer

Monitor the first 24h in Prometheus for latency regressions.

Serving coupling¶

PredictionService in src/app/services/predict.py loads the model as:

mlflow.pyfunc.load_model(f"models:/{model_name}@{stage}")

stage is read from settings.mlflow.model_stage (inference.model_stage in params.yaml). Currently set to smoke (CI / dev environment). Change to candidate for staging, champion for production.

Changing the alias in the registry takes effect on next worker restart without redeployment. The model is lazy-loaded once per worker process and cached in memory.

Rollback¶

Rollback is a registry operation only — no retraining required:

Re-assign the champion alias to any previous version in the MLflow UI or CLI.
Restart workers (or wait for cache invalidation).

This is safe because all prior model versions remain stored as MLflow artifacts in MinIO.

Implementation status¶

Aspect	Status
Automated registration (`smoke` alias) via `register_model` DVC stage	✅ Implemented
Automated candidate gate (`candidate` alias) via `promote_model` DVC stage	✅ Implemented
Manual `candidate → champion` gate	📋 Manual approval required
Rollback via registry alias reassignment	✅ Supported

MLflow — experiment tracking
Model Contract — breaking change policy
Training Pipeline — pipeline stage sequence
Serving — how the serving layer loads models
Status