Skip to content

Model Registry & Promotion

Status: ✅ Implemented — Registration and candidate promotion automated via DVC pipeline; candidate → champion gate is manual.

Purpose

Document how models move from training into serving, what the lifecycle stages mean, how promotion is gated, and how rollback works.


Role of the registry

The MLflow Model Registry is the single handoff point between training and serving. The serving layer (PredictionService) loads the model from the registry by name and alias. No model reaches production without passing through this boundary.


Alias scheme (4 levels)

Alias Meaning Gate Who sets it CI job
ci-smoke Toy model (frac=0.001, n_trials=2) — pipeline wiring check only; never used by serving None — always assigned register_model DVC stage train:smoke
smoke Real-data model; full feature set, reduced trials — lifecycle entry point None — always assigned register_model DVC stage train:test
candidate Passed quality gate; ready for manual review final.logloss ≤ current_candidate + 0.002 promote_model DVC stage train:test
champion Currently serving live predictions Manual sign-off (see Promotion Policy) Developer / scheduled DAG

ci-smoke is set by the experiment=smoke Hydra overlay (conf/experiment/smoke.yaml). All other aliases use the base config (conf/config.yaml).

A single model version can carry multiple aliases simultaneously (e.g. a new version becomes smoke immediately, then candidate after the gate passes, then champion after manual review).


Stage 1: Registration (register_model)

The register_model DVC stage is the final automated step after final_train. It reads data/models/final_run_id.json, creates or updates the registered model, and assigns the initial alias to the new version:

  • ci-smoke — when run from train:smoke CI job (experiment=smoke, toy data)
  • smoke — when run from train:test CI job (real data, lifecycle entry point)

This operation is idempotent: re-running with the same run ID is safe.

The model name and initial alias are controlled via params.yaml (register_model.model_name, register_model.model_stage).


Stage 2: Candidate promotion (promote_model)

The promote_model DVC stage runs after register_model. It fetches final.logloss for the new version and compares it to the current candidate alias:

new_logloss ≤ current_candidate_logloss + tolerance  →  sets 'candidate' alias
new_logloss  > current_candidate_logloss + tolerance  →  logs warning, no change

If no current candidate exists (fresh registry), promotion always proceeds. A gate failure does not fail the DVC pipeline — it is an expected outcome.

Parameters (in params.yaml under promote_model):

Parameter Default Meaning
metric final.logloss MLflow metric key to compare
tolerance 0.005 Max allowed degradation vs current candidate
candidate_alias candidate Alias name to assign on pass

Result is written to data/models/promoted_model.json.


Stage 3: Champion promotion (manual)

Promoting from candidate to champion requires manual approval. This is a deliberate quality gate — not a missing feature.

Hard gates (all three must pass)

Gate Threshold Rationale
Log-loss ≤ logreg_full baseline log-loss must not exceed logistic regression on full features Ensures the proposed champion is strictly better than the simplest meaningful baseline
Brier ≤ current champion − 0.002 challenger Brier at least 0.002 lower than champion Forces a meaningful improvement, not just noise
ECE ≤ 0.05 calibration error ≤ 5% Ensures probabilities are usable for downstream betting simulation and API consumers
# Check candidate vs champion metrics in MLflow CLI
mlflow runs get --run-id <final_run_id>

Key metrics: final.logloss, final.brier, final.ece.

Manual review checklist

  • [ ] All hard gates pass (logloss, brier, ECE)
  • [ ] Calibration curve artifact (calibration_curves.png) — no severe over/under-confidence in any class
  • [ ] Confusion matrix (confusion_matrix_final.png) — no unexpected degradation on specific outcome classes
  • [ ] Segment metrics (segment_metrics.csv) — no severe performance drop in major leagues vs previous champion
  • [ ] Error analysis report (reports/error_analysis/) — no newly introduced systematic error pattern
  • [ ] Holdout slice is strictly 2024+ (no train/test contamination)
  • [ ] Calibration used temporal split (not random split)

Why champion promotion is NOT automated

  1. Dataset size — the football match dataset is small enough that a single bad season can produce gate-passing metrics by chance. Human review of segment metrics provides a sanity check that automated thresholds cannot.
  2. Calibration matters for downstream use — the API serves raw probabilities. A model with ECE just below 0.05 but with structural miscalibration in specific outcome classes should be caught in manual review.
  3. Operational safety — for a portfolio project with a live demo, a human checkpoint before swapping the champion model is appropriate.

Promotion procedure

  1. Verify all hard gates pass.
  2. Complete the manual review checklist.
  3. Run:
import mlflow
client = mlflow.MlflowClient()
candidate = client.get_model_version_by_alias("soccer-match-outcome", "candidate")
client.set_registered_model_alias("soccer-match-outcome", "champion", candidate.version)
print(f"champion → version {candidate.version}")
  1. Restart the serving worker to reload the new model:
kubectl rollout restart deployment/soccer-worker -n soccer
  1. Monitor the first 24h in Prometheus for latency regressions.

Serving coupling

PredictionService in src/app/services/predict.py loads the model as:

mlflow.pyfunc.load_model(f"models:/{model_name}@{stage}")

stage is read from settings.mlflow.model_stage (inference.model_stage in params.yaml). Currently set to smoke (CI / dev environment). Change to candidate for staging, champion for production.

Changing the alias in the registry takes effect on next worker restart without redeployment. The model is lazy-loaded once per worker process and cached in memory.


Rollback

Rollback is a registry operation only — no retraining required:

  1. Re-assign the champion alias to any previous version in the MLflow UI or CLI.
  2. Restart workers (or wait for cache invalidation).

This is safe because all prior model versions remain stored as MLflow artifacts in MinIO.


Implementation status

Aspect Status
Automated registration (smoke alias) via register_model DVC stage ✅ Implemented
Automated candidate gate (candidate alias) via promote_model DVC stage ✅ Implemented
Manual candidate → champion gate 📋 Manual approval required
Rollback via registry alias reassignment ✅ Supported