Monitoring Overview¶

Status: ✅ Operational — Prometheus metrics, Evidently drift detection, ML quality monitor, and Grafana dashboards deployed. AlertManager alert rules pending.

Current Observability Architecture¶

flowchart LR
    subgraph Services ["Runtime Services"]
        API[FastAPI\nInference Service]
        CW[Celery Workers]
        RMQ[RabbitMQ]
    end

    subgraph Collection ["Metrics Collection \u2705"]
        PROM[Prometheus\nGET /metrics]
        KSM[kube-state-metrics]
        NE[node-exporter]
    end

    subgraph Dashboards ["Visualization ✅ Operational"]
        GRAF[Grafana]
    end

    subgraph MLQuality ["ML Quality ✅ Implemented"]
        EV[Evidently\nDrift Detection]
    end

    API -->|/metrics endpoint| PROM
    CW -->|queue/worker stats| PROM
    KSM --> PROM
    NE --> PROM
    PROM -->|scrape| GRAF
    API -.->|prediction logs| EV

Implemented: Prometheus Metrics¶

The inference service exports 10 metrics from GET /metrics:

Metric	Type	What it measures
`http_requests_total`	Counter	HTTP request volume (by method, path, status code)
`http_request_duration_seconds`	Histogram	End-to-end HTTP request latency
`prediction_requests_total`	Counter	Prediction tasks dispatched to the Celery `ml` queue (`source="sync"`)
`prediction_duration_seconds`	Histogram	End-to-end prediction latency including Celery queue roundtrip
`inference_duration_seconds`	Histogram	Pure ML inference time inside the Celery worker
`prediction_confidence`	Histogram	Model predicted probability per outcome class
`model_info`	Gauge	Loaded model metadata (name, version, stage labels); value=1 when loaded
`model_registered_at_seconds`	Gauge	Unix timestamp when model version was last loaded by the worker
`model_feature_drift_score`	Gauge	Evidently dataset drift score; refreshed by `GET /monitoring/drift`
`prediction_timeouts_total`	Counter	Total sync prediction requests that timed out waiting for the ML worker (504 responses)

Operational access today (no Grafana required):

# Raw Prometheus metrics
curl http://api.soccer.dmitryivanov.dev/metrics

# Service health
curl http://api.soccer.dmitryivanov.dev/healthcheck/

# Celery queue depth
curl http://api.soccer.dmitryivanov.dev/monitoring/celery/queues

# Active workers
curl http://api.soccer.dmitryivanov.dev/monitoring/celery/workers

Full metrics reference: Metrics reference · Coverage matrix

Operational monitoring today¶

Two Grafana dashboards are deployed in the SoccerPredictAI folder:

Soccer — ML Quality & Betting — rolling log-loss, ECE, hit-rate, prediction drift (fed by soccer_ml_monitor_quality_01 DAG).
SoccerPredictAI — request rate, latency, error rate, queue depth, model version, drift score.

Manual triage endpoints remain useful for direct inspection:

Health check: GET /healthcheck/ — service up and model loaded.
Metrics scrape: GET /metrics — point-in-time counters and gauges.
Queue inspection: GET /monitoring/celery/queues — inference queue depth.
Model load state: model_info == 0 means workers are running but no model is loaded.

See Runbooks: Alerts for incident response procedures.

Implemented: Evidently Drift Detection¶

Feature drift detection is operational via three paths: - src/pipelines/monitor_drift.py — DVC stage (monitor_drift) that writes reports/drift/latest.json - airflow/dags/ml_monitor_drift_01.py — scheduled Airflow DAG - GET /monitoring/drift — REST endpoint that reads reports/drift/latest.json and refreshes model_feature_drift_score Prometheus gauge

See Evidently Integration for design details.

Planned: Phase 2 (Current Priority)¶

[ ] AlertManager rule: model_info == 0 — fires if workers lose the model

Planned: Phase 3¶

[ ] Drift metrics exported to Prometheus
[ ] PostgreSQL query latency via pg_exporter
[ ] Log aggregation (Loki or ELK)
[ ] On-call escalation policy

Alert Rules¶

Status: 🚧 Partial — 6 rule files written and deployed as Prometheus ConfigMaps via Helm (api-error-rate, api-latency, celery-backlog, drift, ml-quality, model-staleness). Alertmanager is not configured — no routing or receivers exist.

Service health¶

Condition	Intended severity
`model_info == 0`	P1 — model not available
Sustained `http_request_duration_seconds` p99 > 500ms	P2 — latency SLO breach
Elevated `http_requests_total{status_code=~"5.."}` rate	P2 — error spike
Service unavailable (`/healthcheck/` failing)	P1

Async pipeline¶

Condition	Intended severity
`celery_queue_depth` growing without bound	P2
No workers detected via `GET /monitoring/celery/workers`	P1 — no workers processing

Alert routing (intended): P1/P2 → notify operator with runbook link. P3/P4 → log only.

Incident Response¶

Detection is manual today — via GET /metrics, GET /healthcheck/, and GET /monitoring/celery/*.

Severity levels¶

Level	Definition	Response Time
P1	API down, no predictions served	Immediate (< 15 min)
P2	Degraded latency or partial failure	< 1 hour
P3	Data pipeline failure, no model impact yet	< 4 hours
P4	Non-critical warning, cosmetic issues	Next business day

Response process¶

Detect — GET /healthcheck/, GET /metrics, GET /monitoring/celery/queues
Diagnose — Prometheus metrics, kubectl logs, queue depth, model_info gauge
Mitigate — restart pod, rollback model alias, clear queue
Resolve — confirm metrics return to baseline
Document — record what happened, what changed

Metrics Reference — full metric definitions and labels
Dashboard Definitions — planned Grafana panel specs
Evidently Integration — drift detection design
Coverage Matrix — what is and isn't covered
Troubleshooting

Coverage Matrix¶

This is the authoritative source of truth for what observability is actually in place today.

Last updated: May 27, 2026

Layer	Tool	Status	What it covers
API request rate	Prometheus middleware	✅ Implemented	`http_requests_total` counter
API latency	Prometheus histogram	✅ Implemented	`http_request_duration_seconds`
Prediction volume	Prometheus counter	✅ Implemented	`prediction_requests_total`
Prediction latency	Prometheus histogram	✅ Implemented	`prediction_duration_seconds`
Inference latency (worker)	Prometheus histogram	✅ Implemented	`inference_duration_seconds`
Prediction confidence distribution	Prometheus histogram	✅ Implemented	`prediction_confidence`
Model metadata	Prometheus gauge	✅ Implemented	`model_info` (name, version, stage labels)
Model load timestamp	Prometheus gauge	✅ Implemented	`model_registered_at_seconds`
Feature drift score	Prometheus gauge	✅ Implemented	`model_feature_drift_score` — refreshed by `GET /monitoring/drift`
Celery queue depth	`GET /monitoring/celery/queues`	✅ Implemented	Per-queue message count (REST only, not Prometheus-scraped)
Celery active workers	`GET /monitoring/celery/workers`	✅ Implemented	Worker ping status (REST only)
Task status polling	`GET /monitoring/task_status/{id}`	✅ Implemented	Celery result backend
Service liveness	`GET /healthcheck/`	✅ Implemented	Memory + status
Feature drift (Evidently)	DVC stage + Airflow DAG + REST	✅ Implemented	`src/pipelines/monitor_drift.py`; `GET /monitoring/drift`; `model_feature_drift_score` gauge
ML quality metrics (rolling)	Airflow DAG + node-exporter	✅ Implemented	`soccer_ml_monitor_quality_01` DAG; log-loss, ECE, hit-rate
Grafana dashboards	Grafana	✅ Implemented	Two dashboards in `SoccerPredictAI` folder deployed
AlertManager alert rules	AlertManager	� Partial	6 rule files deployed as ConfigMaps via Helm; Alertmanager not configured (no routing or receivers)
PostgreSQL metrics	pg_exporter	📋 Planned	Not yet configured
Prediction drift (Evidently)	Evidently	📋 Planned	Requires ground truth feedback loop
Model performance monitoring	MLflow + Evidently	📋 Planned	Ground truth lag ~90 min after match
Alerting rules (AlertManager)	AlertManager	� Partial	6 rule files deployed as ConfigMaps via Helm; Alertmanager not configured
Log aggregation	ELK / Loki	📋 Planned	stdout today

Gaps and planned work¶

Priority 1 (next sprint): - Configure Alertmanager routing and receivers (rules are already deployed as ConfigMaps).

Priority 2: - Integrate Evidently for feature drift detection (offline batch report first). - Add pg_exporter sidecar for PostgreSQL query latency.

Priority 3: - Real-time ground truth feedback loop (match result arrives ~90 min after KO). - Automated retraining trigger on confirmed drift signal.