Serving¶
The serving layer exposes trained models as an operational inference service. It is deployed on Kubernetes, receives predictions requests over HTTP, dispatches inference tasks to Celery workers, and returns structured probability output.
See Current Serving Status for a full readiness matrix.
Responsibilities¶
The serving layer is responsible for:
- exposing the canonical inference API (sync and async prediction endpoints),
- validating request schemas before any inference logic runs,
- dispatching inference tasks to the Celery
mlqueue, - loading the registered model from the MLflow Registry (once per worker process),
- returning structured prediction output with model traceability metadata,
- surfacing health and operational metrics.
The serving layer is not responsible for:
- training or evaluating models — see ML,
- promoting models between registry stages — see Model Registry,
- scraping or storing raw data — see Data.
Scope of this section¶
| Page | Content |
|---|---|
| API Contract | Canonical endpoint set, request/response schemas, error semantics |
| Examples | Concrete cURL and Python examples for all implemented endpoints |
| Inference Modes | Sync vs async execution paths and operational trade-offs |
| Deployment | Serving-specific runtime components, configuration, model loading |
| Health & Failures | Health probes, degraded-mode behavior, failure responses |
| Performance | Latency behavior, SLO targets, interpretation guide |
| Current Status | Authoritative readiness matrix for the serving subsystem |
Boundaries¶
- High-level runtime design and sequence diagrams belong in Architecture: Runtime View.
- Physical topology and deployment constraints belong in Architecture: Deployment View.
- Model input/output contract belongs in ML: Model Contract.
- Model promotion lifecycle belongs in ML: Model Registry.
- Global implementation readiness belongs in Status.
This section deepens those topics for the inference API subsystem specifically.
Current Status¶
This is the authoritative source of truth for what the inference API does today.
Last updated: May 27, 2026
Component status¶
| Component | Status | Notes |
|---|---|---|
GET /predict/predictions/ (bulk parquet read) |
✅ Implemented | All precomputed predictions, display cols only |
GET /predict/precomputed/{match_id} (parquet lookup, no Celery) |
✅ Implemented | Direct cache read, 404 if absent |
GET /predict/cards/ (predictions + Fonbet odds merged) |
✅ Implemented | Primary Streamlit UI data source |
GET /predict/region-roi/ (live-betting ROI per region) |
✅ Implemented | From roi_by_region.csv produced by live-betting stage |
GET /predict/odds/ (Fonbet 1X2 odds) |
✅ Implemented | Reads fonbet_odds.parquet |
GET /predict/{match_id} (sync Celery dispatch, 30 s timeout) |
✅ Implemented | Features from parquet → Celery ml queue |
GET /predict/model/info (MLflow registry metadata, sync) |
✅ Implemented | Shows loaded model version and metrics |
GET /monitoring/task_status/{task_id} (poll Celery result) |
✅ Implemented | Status + result polling |
GET /monitoring/celery/queues (queue stats) |
✅ Implemented | Active/scheduled/reserved counts |
GET /monitoring/celery/workers (worker stats) |
✅ Implemented | Active queues + ping |
GET /monitoring/drift (Evidently drift report) |
✅ Implemented | Reads reports/drift/latest.json; refreshes Prometheus gauge |
GET /healthcheck/ (liveness — DB connectivity check) |
✅ Implemented | K8s liveness probe |
GET /metrics (Prometheus) |
✅ Implemented | 9 counters/histograms/gauges |
| Pydantic request validation | ✅ Implemented | src/app/schemas/predict.py |
| Model lazy-loading from MLflow registry | ✅ Implemented | Once per Celery worker process |
| Streamlit UI | ✅ Implemented | src/ui/app/main.py — match list with predictions, Fonbet odds, Value bets signal, prediction accuracy (Pred), dynamic ROI panel, Min region ROI slider, filters: Region / Status / Period |
HTTP batch endpoint (POST /predict/batch) |
❌ Out of scope | Batch inference is DVC-only by design — see ADR-0006 |
| Grafana dashboard for API metrics | 📋 Planned | Prometheus collecting; dashboards not yet deployed |
| Docker image | ✅ Built | Multi-stage build |
| Kubernetes deployment | ✅ Deployed | 1 API pod, 1 Celery worker (default values.yaml; HPA template exists but autoscaling.enabled: false) |
| Helm chart | ✅ Complete | Parameterized values |
Inference flow (precomputed path — Streamlit UI)¶
Streamlit UI → GET /predict/cards/
↓
MatchCardService: merge predictions.parquet + fonbet_odds.parquet
(in-memory cache, no MinIO call unless files changed)
↓
JSON response: probabilities + odds + outcome + Fonbet URL
Inference flow (live sync path)¶
Client → GET /predict/{match_id}?stage=champion
↓
FeatureLookupService: read match_features.parquet (in-memory cache)
↓
Celery task enqueued → ml queue
↓
Worker: load model from MLflow registry (lazy, cached per worker)
↓
Worker: model.predict_proba(features)
↓
Task result polled (≤ 30 s)
↓
FastAPI returns JSON response
Known limitations¶
- HTTP batch endpoint is out of scope by design (see ADR-0006) — batch inference runs via DVC pipeline only.
- RabbitMQ is single-broker (no clustering); task backlog possible under sustained load.
- Model promotion from
candidatetochampionrequires manual approval. - Grafana dashboards not yet deployed — Prometheus is exporting data.
/monitoring/*endpoints are currently unauthenticated (planned to restrict).
SLO targets (informal)¶
| SLO | Target | Status |
|---|---|---|
| Sync p50 latency | ≤ 50ms | Not formally measured |
| Sync p99 latency | ≤ 200ms | Not formally measured |
| Service availability (30d) | ≥ 99% | Not formally tracked |
Load test baseline available via tests/load/locustfile.py.