Model Registry & Promotion¶
Status: ✅ Implemented — Registration and candidate promotion automated via DVC pipeline; candidate → champion gate is manual.
Purpose¶
Document how models move from training into serving, what the lifecycle stages mean, how promotion is gated, and how rollback works.
Role of the registry¶
The MLflow Model Registry is the single handoff point between training and serving.
The serving layer (PredictionService) loads the model from the registry by name and alias.
No model reaches production without passing through this boundary.
Alias scheme (4 levels)¶
| Alias | Meaning | Gate | Who sets it | CI job |
|---|---|---|---|---|
ci-smoke |
Toy model (frac=0.001, n_trials=2) — pipeline wiring check only; never used by serving |
None — always assigned | register_model DVC stage |
train:smoke |
smoke |
Real-data model; full feature set, reduced trials — lifecycle entry point | None — always assigned | register_model DVC stage |
train:test |
candidate |
Passed quality gate; ready for manual review | final.logloss ≤ current_candidate + 0.002 |
promote_model DVC stage |
train:test |
champion |
Currently serving live predictions | Manual sign-off (see Promotion Policy) | Developer / scheduled DAG | — |
ci-smoke is set by the experiment=smoke Hydra overlay (conf/experiment/smoke.yaml).
All other aliases use the base config (conf/config.yaml).
A single model version can carry multiple aliases simultaneously (e.g. a new version
becomes smoke immediately, then candidate after the gate passes, then champion
after manual review).
Stage 1: Registration (register_model)¶
The register_model DVC stage is the final automated step after final_train.
It reads data/models/final_run_id.json, creates or updates the registered model,
and assigns the initial alias to the new version:
ci-smoke— when run fromtrain:smokeCI job (experiment=smoke, toy data)smoke— when run fromtrain:testCI job (real data, lifecycle entry point)
This operation is idempotent: re-running with the same run ID is safe.
The model name and initial alias are controlled via params.yaml
(register_model.model_name, register_model.model_stage).
Stage 2: Candidate promotion (promote_model)¶
The promote_model DVC stage runs after register_model.
It fetches final.logloss for the new version and compares it to the current
candidate alias:
new_logloss ≤ current_candidate_logloss + tolerance → sets 'candidate' alias
new_logloss > current_candidate_logloss + tolerance → logs warning, no change
If no current candidate exists (fresh registry), promotion always proceeds.
A gate failure does not fail the DVC pipeline — it is an expected outcome.
Parameters (in params.yaml under promote_model):
| Parameter | Default | Meaning |
|---|---|---|
metric |
final.logloss |
MLflow metric key to compare |
tolerance |
0.005 |
Max allowed degradation vs current candidate |
candidate_alias |
candidate |
Alias name to assign on pass |
Result is written to data/models/promoted_model.json.
Stage 3: Champion promotion (manual)¶
Promoting from candidate to champion requires manual approval. This is a deliberate
quality gate — not a missing feature.
Hard gates (all three must pass)¶
| Gate | Threshold | Rationale |
|---|---|---|
Log-loss ≤ logreg_full baseline |
log-loss must not exceed logistic regression on full features | Ensures the proposed champion is strictly better than the simplest meaningful baseline |
| Brier ≤ current champion − 0.002 | challenger Brier at least 0.002 lower than champion | Forces a meaningful improvement, not just noise |
| ECE ≤ 0.05 | calibration error ≤ 5% | Ensures probabilities are usable for downstream betting simulation and API consumers |
Key metrics: final.logloss, final.brier, final.ece.
Manual review checklist¶
- [ ] All hard gates pass (logloss, brier, ECE)
- [ ] Calibration curve artifact (
calibration_curves.png) — no severe over/under-confidence in any class - [ ] Confusion matrix (
confusion_matrix_final.png) — no unexpected degradation on specific outcome classes - [ ] Segment metrics (
segment_metrics.csv) — no severe performance drop in major leagues vs previous champion - [ ] Error analysis report (
reports/error_analysis/) — no newly introduced systematic error pattern - [ ] Holdout slice is strictly 2024+ (no train/test contamination)
- [ ] Calibration used temporal split (not random split)
Why champion promotion is NOT automated¶
- Dataset size — the football match dataset is small enough that a single bad season can produce gate-passing metrics by chance. Human review of segment metrics provides a sanity check that automated thresholds cannot.
- Calibration matters for downstream use — the API serves raw probabilities. A model with ECE just below 0.05 but with structural miscalibration in specific outcome classes should be caught in manual review.
- Operational safety — for a portfolio project with a live demo, a human checkpoint before swapping the champion model is appropriate.
Promotion procedure¶
- Verify all hard gates pass.
- Complete the manual review checklist.
- Run:
import mlflow
client = mlflow.MlflowClient()
candidate = client.get_model_version_by_alias("soccer-match-outcome", "candidate")
client.set_registered_model_alias("soccer-match-outcome", "champion", candidate.version)
print(f"champion → version {candidate.version}")
- Restart the serving worker to reload the new model:
- Monitor the first 24h in Prometheus for latency regressions.
Serving coupling¶
PredictionService in src/app/services/predict.py loads the model as:
stage is read from settings.mlflow.model_stage (inference.model_stage in params.yaml).
Currently set to smoke (CI / dev environment).
Change to candidate for staging, champion for production.
Changing the alias in the registry takes effect on next worker restart without redeployment. The model is lazy-loaded once per worker process and cached in memory.
Rollback¶
Rollback is a registry operation only — no retraining required:
- Re-assign the
championalias to any previous version in the MLflow UI or CLI. - Restart workers (or wait for cache invalidation).
This is safe because all prior model versions remain stored as MLflow artifacts in MinIO.
Implementation status¶
| Aspect | Status |
|---|---|
Automated registration (smoke alias) via register_model DVC stage |
✅ Implemented |
Automated candidate gate (candidate alias) via promote_model DVC stage |
✅ Implemented |
Manual candidate → champion gate |
📋 Manual approval required |
| Rollback via registry alias reassignment | ✅ Supported |
Related¶
- MLflow — experiment tracking
- Model Contract — breaking change policy
- Training Pipeline — pipeline stage sequence
- Serving — how the serving layer loads models
- Status