Monitoring Overview¶
Status: ✅ Operational — Prometheus metrics, Evidently drift detection, ML quality monitor, and Grafana dashboards deployed. AlertManager alert rules pending.
Current Observability Architecture¶
flowchart LR
subgraph Services ["Runtime Services"]
API[FastAPI\nInference Service]
CW[Celery Workers]
RMQ[RabbitMQ]
end
subgraph Collection ["Metrics Collection \u2705"]
PROM[Prometheus\nGET /metrics]
KSM[kube-state-metrics]
NE[node-exporter]
end
subgraph Dashboards ["Visualization ✅ Operational"]
GRAF[Grafana]
end
subgraph MLQuality ["ML Quality ✅ Implemented"]
EV[Evidently\nDrift Detection]
end
API -->|/metrics endpoint| PROM
CW -->|queue/worker stats| PROM
KSM --> PROM
NE --> PROM
PROM -->|scrape| GRAF
API -.->|prediction logs| EV
Implemented: Prometheus Metrics¶
The inference service exports 9 metrics from GET /metrics:
| Metric | Type | What it measures |
|---|---|---|
http_requests_total |
Counter | HTTP request volume (by method, path, status code) |
http_request_duration_seconds |
Histogram | End-to-end HTTP request latency |
prediction_requests_total |
Counter | Prediction tasks dispatched to the Celery ml queue (source="sync") |
prediction_duration_seconds |
Histogram | End-to-end prediction latency including Celery queue roundtrip |
inference_duration_seconds |
Histogram | Pure ML inference time inside the Celery worker |
prediction_confidence |
Histogram | Model predicted probability per outcome class |
model_info |
Gauge | Loaded model metadata (name, version, stage labels); value=1 when loaded |
model_registered_at_seconds |
Gauge | Unix timestamp when model version was last loaded by the worker |
model_feature_drift_score |
Gauge | Evidently dataset drift score; refreshed by GET /monitoring/drift |
Operational access today (no Grafana required):
# Raw Prometheus metrics
curl http://api.soccer.dmitryivanov.dev/metrics
# Service health
curl http://api.soccer.dmitryivanov.dev/healthcheck/
# Celery queue depth
curl http://api.soccer.dmitryivanov.dev/monitoring/celery/queues
# Active workers
curl http://api.soccer.dmitryivanov.dev/monitoring/celery/workers
Full metrics reference: Metrics reference · Coverage matrix
Operational monitoring today¶
Two Grafana dashboards are deployed in the SoccerPredictAI folder:
- Soccer — ML Quality & Betting — rolling log-loss, ECE, hit-rate, prediction drift (fed by
soccer_ml_monitor_quality_01DAG). - SoccerPredictAI — request rate, latency, error rate, queue depth, model version, drift score.
Manual triage endpoints remain useful for direct inspection:
- Health check:
GET /healthcheck/— service up and model loaded. - Metrics scrape:
GET /metrics— point-in-time counters and gauges. - Queue inspection:
GET /monitoring/celery/queues— inference queue depth. - Model load state:
model_info == 0means workers are running but no model is loaded.
See Runbooks: Alerts for incident response procedures.
Implemented: Evidently Drift Detection¶
Feature drift detection is operational via three paths:
- src/pipelines/monitor_drift.py — DVC stage (monitor_drift) that writes reports/drift/latest.json
- airflow/dags/ml_monitor_drift_01.py — scheduled Airflow DAG
- GET /monitoring/drift — REST endpoint that reads reports/drift/latest.json and refreshes model_feature_drift_score Prometheus gauge
See Evidently Integration for design details.
Planned: Phase 2 (Current Priority)¶
- [ ] AlertManager rule:
model_info == 0— fires if workers lose the model
Planned: Phase 3¶
- [ ] Drift metrics exported to Prometheus
- [ ] PostgreSQL query latency via
pg_exporter - [ ] Log aggregation (Loki or ELK)
- [ ] On-call escalation policy
Alert Rules (Planned)¶
Status: 📋 Planned — AlertManager is not deployed. No active alert rules or notification channels exist today.
Service health¶
| Condition | Intended severity |
|---|---|
model_info == 0 |
P1 — model not available |
Sustained http_request_duration_seconds p99 > 500ms |
P2 — latency SLO breach |
Elevated http_requests_total{status_code=~"5.."} rate |
P2 — error spike |
Service unavailable (/healthcheck/ failing) |
P1 |
Async pipeline¶
| Condition | Intended severity |
|---|---|
celery_queue_depth growing without bound |
P2 |
No workers detected via GET /monitoring/celery/workers |
P1 — no workers processing |
Alert routing (intended): P1/P2 → notify operator with runbook link. P3/P4 → log only.
Incident Response¶
Detection is manual today — via GET /metrics, GET /healthcheck/, and GET /monitoring/celery/*.
Severity levels¶
| Level | Definition | Response Time |
|---|---|---|
| P1 | API down, no predictions served | Immediate (< 15 min) |
| P2 | Degraded latency or partial failure | < 1 hour |
| P3 | Data pipeline failure, no model impact yet | < 4 hours |
| P4 | Non-critical warning, cosmetic issues | Next business day |
Response process¶
- Detect —
GET /healthcheck/,GET /metrics,GET /monitoring/celery/queues - Diagnose — Prometheus metrics,
kubectl logs, queue depth,model_infogauge - Mitigate — restart pod, rollback model alias, clear queue
- Resolve — confirm metrics return to baseline
- Document — record what happened, what changed
Related¶
- Metrics Reference — full metric definitions and labels
- Dashboard Definitions — planned Grafana panel specs
- Evidently Integration — drift detection design
- Coverage Matrix — what is and isn't covered
- Troubleshooting
Coverage Matrix¶
This is the authoritative source of truth for what observability is actually in place today.
Last updated: May 27, 2026
| Layer | Tool | Status | What it covers |
|---|---|---|---|
| API request rate | Prometheus middleware | ✅ Implemented | http_requests_total counter |
| API latency | Prometheus histogram | ✅ Implemented | http_request_duration_seconds |
| Prediction volume | Prometheus counter | ✅ Implemented | prediction_requests_total |
| Prediction latency | Prometheus histogram | ✅ Implemented | prediction_duration_seconds |
| Inference latency (worker) | Prometheus histogram | ✅ Implemented | inference_duration_seconds |
| Prediction confidence distribution | Prometheus histogram | ✅ Implemented | prediction_confidence |
| Model metadata | Prometheus gauge | ✅ Implemented | model_info (name, version, stage labels) |
| Model load timestamp | Prometheus gauge | ✅ Implemented | model_registered_at_seconds |
| Feature drift score | Prometheus gauge | ✅ Implemented | model_feature_drift_score — refreshed by GET /monitoring/drift |
| Celery queue depth | GET /monitoring/celery/queues |
✅ Implemented | Per-queue message count (REST only, not Prometheus-scraped) |
| Celery active workers | GET /monitoring/celery/workers |
✅ Implemented | Worker ping status (REST only) |
| Task status polling | GET /monitoring/task_status/{id} |
✅ Implemented | Celery result backend |
| Service liveness | GET /healthcheck/ |
✅ Implemented | Memory + status |
| Feature drift (Evidently) | DVC stage + Airflow DAG + REST | ✅ Implemented | src/pipelines/monitor_drift.py; GET /monitoring/drift; model_feature_drift_score gauge |
| ML quality metrics (rolling) | Airflow DAG + node-exporter | ✅ Implemented | soccer_ml_monitor_quality_01 DAG; log-loss, ECE, hit-rate |
| Grafana dashboards | Grafana | ✅ Implemented | Two dashboards in SoccerPredictAI folder deployed |
| AlertManager alert rules | AlertManager | 📋 Planned | No active rules or notification channels yet |
| PostgreSQL metrics | pg_exporter | 📋 Planned | Not yet configured |
| Prediction drift (Evidently) | Evidently | 📋 Planned | Requires ground truth feedback loop |
| Model performance monitoring | MLflow + Evidently | 📋 Planned | Ground truth lag ~90 min after match |
| Alerting rules (AlertManager) | AlertManager | 📋 Planned | Runbooks written; rules not deployed |
| Log aggregation | ELK / Loki | 📋 Planned | stdout today |
Gaps and planned work¶
Priority 1 (next sprint):
- Deploy Grafana with a soccer-api dashboard (data is already being collected).
- Configure AlertManager rule for model_info == 0 (model not loaded).
Priority 2: - Integrate Evidently for feature drift detection (offline batch report first). - Add pg_exporter sidecar for PostgreSQL query latency.
Priority 3: - Real-time ground truth feedback loop (match result arrives ~90 min after KO). - Automated retraining trigger on confirmed drift signal.