Service & Infrastructure Metrics (Prometheus)¶

Status: ✅ Operational — GET /metrics endpoint live, 10 metrics exported

Metrics are collected from the FastAPI inference service via src/app/metrics.py and a _PrometheusMiddleware applied to all requests.

Available at: GET /metrics (Prometheus exposition format)

Exported metrics¶

HTTP layer (✅ live)¶

Metric	Type	Description
`http_requests_total`	Counter	Total HTTP requests by method, path, and status code
`http_request_duration_seconds`	Histogram	End-to-end HTTP request latency by method and path

Prediction API (✅ live)¶

Metric	Type	Description
`prediction_requests_total`	Counter	On-demand prediction tasks dispatched to the Celery `ml` queue (`source="sync"`)
`prediction_duration_seconds`	Histogram	End-to-end prediction latency including Celery queue roundtrip (sync path)

ML worker / model (✅ live)¶

Metric	Type	Description
`inference_duration_seconds`	Histogram	Pure ML inference time inside the Celery worker (excluding queue wait)
`prediction_confidence`	Histogram	Model predicted probability per outcome class (`outcome="home_win\\|draw\\|away_win"`)
`model_info`	Gauge	Metadata of the currently loaded model; value=1 when loaded (`model_name`, `version`, `stage` labels)
`model_registered_at_seconds`	Gauge	Unix timestamp when the currently loaded model version was last loaded by the worker
`model_feature_drift_score`	Gauge	Evidently dataset drift score (share of drifted features); updated by `GET /monitoring/drift`

Error counters (✅ live)¶

Metric	Type	Description
`prediction_timeouts_total`	Counter	Total sync prediction requests that timed out waiting for the ML worker (HTTP 504 responses)

Celery runtime status is available via REST (not Prometheus-scraped): - GET /monitoring/celery/queues — per-queue message count - GET /monitoring/celery/workers — active worker ping status

Not yet implemented¶

RabbitMQ queue metrics via dedicated exporter
Kubernetes CPU / memory / pod restarts
PostgreSQL query latency via pg_exporter
Log aggregation (stdout only today)

Dashboards¶

Two Grafana dashboards are deployed in the SoccerPredictAI folder: - Soccer — ML Quality & Betting — rolling log-loss, ECE, hit-rate, prediction drift. - SoccerPredictAI — request rate, latency, error rate, queue depth, model version, drift score.

See Dashboards and Monitoring Status.