Skip to content

Monitoring Overview

Status: ✅ Operational — Prometheus metrics, Evidently drift detection, ML quality monitor, and Grafana dashboards deployed. AlertManager alert rules pending.


Current Observability Architecture

flowchart LR
    subgraph Services ["Runtime Services"]
        API[FastAPI\nInference Service]
        CW[Celery Workers]
        RMQ[RabbitMQ]
    end

    subgraph Collection ["Metrics Collection \u2705"]
        PROM[Prometheus\nGET /metrics]
        KSM[kube-state-metrics]
        NE[node-exporter]
    end

    subgraph Dashboards ["Visualization ✅ Operational"]
        GRAF[Grafana]
    end

    subgraph MLQuality ["ML Quality ✅ Implemented"]
        EV[Evidently\nDrift Detection]
    end

    API -->|/metrics endpoint| PROM
    CW -->|queue/worker stats| PROM
    KSM --> PROM
    NE --> PROM
    PROM -->|scrape| GRAF
    API -.->|prediction logs| EV

Implemented: Prometheus Metrics

The inference service exports 9 metrics from GET /metrics:

Metric Type What it measures
http_requests_total Counter HTTP request volume (by method, path, status code)
http_request_duration_seconds Histogram End-to-end HTTP request latency
prediction_requests_total Counter Prediction tasks dispatched to the Celery ml queue (source="sync")
prediction_duration_seconds Histogram End-to-end prediction latency including Celery queue roundtrip
inference_duration_seconds Histogram Pure ML inference time inside the Celery worker
prediction_confidence Histogram Model predicted probability per outcome class
model_info Gauge Loaded model metadata (name, version, stage labels); value=1 when loaded
model_registered_at_seconds Gauge Unix timestamp when model version was last loaded by the worker
model_feature_drift_score Gauge Evidently dataset drift score; refreshed by GET /monitoring/drift

Operational access today (no Grafana required):

# Raw Prometheus metrics
curl http://api.soccer.dmitryivanov.dev/metrics

# Service health
curl http://api.soccer.dmitryivanov.dev/healthcheck/

# Celery queue depth
curl http://api.soccer.dmitryivanov.dev/monitoring/celery/queues

# Active workers
curl http://api.soccer.dmitryivanov.dev/monitoring/celery/workers

Full metrics reference: Metrics reference · Coverage matrix


Operational monitoring today

Two Grafana dashboards are deployed in the SoccerPredictAI folder:

  • Soccer — ML Quality & Betting — rolling log-loss, ECE, hit-rate, prediction drift (fed by soccer_ml_monitor_quality_01 DAG).
  • SoccerPredictAI — request rate, latency, error rate, queue depth, model version, drift score.

Manual triage endpoints remain useful for direct inspection:

  1. Health check: GET /healthcheck/ — service up and model loaded.
  2. Metrics scrape: GET /metrics — point-in-time counters and gauges.
  3. Queue inspection: GET /monitoring/celery/queues — inference queue depth.
  4. Model load state: model_info == 0 means workers are running but no model is loaded.

See Runbooks: Alerts for incident response procedures.


Implemented: Evidently Drift Detection

Feature drift detection is operational via three paths: - src/pipelines/monitor_drift.py — DVC stage (monitor_drift) that writes reports/drift/latest.json - airflow/dags/ml_monitor_drift_01.py — scheduled Airflow DAG - GET /monitoring/drift — REST endpoint that reads reports/drift/latest.json and refreshes model_feature_drift_score Prometheus gauge

See Evidently Integration for design details.


Planned: Phase 2 (Current Priority)

  • [ ] AlertManager rule: model_info == 0 — fires if workers lose the model

Planned: Phase 3

  • [ ] Drift metrics exported to Prometheus
  • [ ] PostgreSQL query latency via pg_exporter
  • [ ] Log aggregation (Loki or ELK)
  • [ ] On-call escalation policy

Alert Rules (Planned)

Status: 📋 Planned — AlertManager is not deployed. No active alert rules or notification channels exist today.

Service health

Condition Intended severity
model_info == 0 P1 — model not available
Sustained http_request_duration_seconds p99 > 500ms P2 — latency SLO breach
Elevated http_requests_total{status_code=~"5.."} rate P2 — error spike
Service unavailable (/healthcheck/ failing) P1

Async pipeline

Condition Intended severity
celery_queue_depth growing without bound P2
No workers detected via GET /monitoring/celery/workers P1 — no workers processing

Alert routing (intended): P1/P2 → notify operator with runbook link. P3/P4 → log only.


Incident Response

Detection is manual today — via GET /metrics, GET /healthcheck/, and GET /monitoring/celery/*.

Severity levels

Level Definition Response Time
P1 API down, no predictions served Immediate (< 15 min)
P2 Degraded latency or partial failure < 1 hour
P3 Data pipeline failure, no model impact yet < 4 hours
P4 Non-critical warning, cosmetic issues Next business day

Response process

  1. DetectGET /healthcheck/, GET /metrics, GET /monitoring/celery/queues
  2. Diagnose — Prometheus metrics, kubectl logs, queue depth, model_info gauge
  3. Mitigate — restart pod, rollback model alias, clear queue
  4. Resolve — confirm metrics return to baseline
  5. Document — record what happened, what changed


Coverage Matrix

This is the authoritative source of truth for what observability is actually in place today.

Last updated: May 27, 2026

Layer Tool Status What it covers
API request rate Prometheus middleware ✅ Implemented http_requests_total counter
API latency Prometheus histogram ✅ Implemented http_request_duration_seconds
Prediction volume Prometheus counter ✅ Implemented prediction_requests_total
Prediction latency Prometheus histogram ✅ Implemented prediction_duration_seconds
Inference latency (worker) Prometheus histogram ✅ Implemented inference_duration_seconds
Prediction confidence distribution Prometheus histogram ✅ Implemented prediction_confidence
Model metadata Prometheus gauge ✅ Implemented model_info (name, version, stage labels)
Model load timestamp Prometheus gauge ✅ Implemented model_registered_at_seconds
Feature drift score Prometheus gauge ✅ Implemented model_feature_drift_score — refreshed by GET /monitoring/drift
Celery queue depth GET /monitoring/celery/queues ✅ Implemented Per-queue message count (REST only, not Prometheus-scraped)
Celery active workers GET /monitoring/celery/workers ✅ Implemented Worker ping status (REST only)
Task status polling GET /monitoring/task_status/{id} ✅ Implemented Celery result backend
Service liveness GET /healthcheck/ ✅ Implemented Memory + status
Feature drift (Evidently) DVC stage + Airflow DAG + REST ✅ Implemented src/pipelines/monitor_drift.py; GET /monitoring/drift; model_feature_drift_score gauge
ML quality metrics (rolling) Airflow DAG + node-exporter ✅ Implemented soccer_ml_monitor_quality_01 DAG; log-loss, ECE, hit-rate
Grafana dashboards Grafana ✅ Implemented Two dashboards in SoccerPredictAI folder deployed
AlertManager alert rules AlertManager 📋 Planned No active rules or notification channels yet
PostgreSQL metrics pg_exporter 📋 Planned Not yet configured
Prediction drift (Evidently) Evidently 📋 Planned Requires ground truth feedback loop
Model performance monitoring MLflow + Evidently 📋 Planned Ground truth lag ~90 min after match
Alerting rules (AlertManager) AlertManager 📋 Planned Runbooks written; rules not deployed
Log aggregation ELK / Loki 📋 Planned stdout today

Gaps and planned work

Priority 1 (next sprint): - Deploy Grafana with a soccer-api dashboard (data is already being collected). - Configure AlertManager rule for model_info == 0 (model not loaded).

Priority 2: - Integrate Evidently for feature drift detection (offline batch report first). - Add pg_exporter sidecar for PostgreSQL query latency.

Priority 3: - Real-time ground truth feedback loop (match result arrives ~90 min after KO). - Automated retraining trigger on confirmed drift signal.