Serving¶

The serving layer exposes trained models as an operational inference service. It is deployed on Kubernetes, receives predictions requests over HTTP, dispatches inference tasks to Celery workers, and returns structured probability output.

See Current Serving Status for a full readiness matrix.

Responsibilities¶

The serving layer is responsible for:

exposing the canonical inference API (sync and async prediction endpoints),
validating request schemas before any inference logic runs,
dispatching inference tasks to the Celery ml queue,
loading the registered model from the MLflow Registry (once per worker process),
returning structured prediction output with model traceability metadata,
surfacing health and operational metrics.

The serving layer is not responsible for:

training or evaluating models — see ML,
promoting models between registry stages — see Model Registry,
scraping or storing raw data — see Data.

Scope of this section¶

Page	Content
API Contract	Canonical endpoint set, request/response schemas, error semantics
Examples	Concrete cURL and Python examples for all implemented endpoints
Inference Modes	Sync vs async execution paths and operational trade-offs
Deployment	Serving-specific runtime components, configuration, model loading
Health & Failures	Health probes, degraded-mode behavior, failure responses
Performance	Latency behavior, SLO targets, interpretation guide
Current Status	Authoritative readiness matrix for the serving subsystem

Boundaries¶

High-level runtime design and sequence diagrams belong in Architecture: Runtime View.
Physical topology and deployment constraints belong in Architecture: Deployment View.
Model input/output contract belongs in ML: Model Contract.
Model promotion lifecycle belongs in ML: Model Registry.
Global implementation readiness belongs in Status.

This section deepens those topics for the inference API subsystem specifically.

Current Status¶

This is the authoritative source of truth for what the inference API does today.

Last updated: May 27, 2026

Component status¶

Component	Status	Notes
`GET /predict/predictions/` (bulk parquet read)	✅ Implemented	All precomputed predictions, display cols only
`GET /predict/precomputed/{match_id}` (parquet lookup, no Celery)	✅ Implemented	Direct cache read, 404 if absent
`GET /predict/cards/` (predictions + Fonbet odds merged)	✅ Implemented	Primary Streamlit UI data source
`GET /predict/region-roi/` (live-betting ROI per region)	✅ Implemented	From `roi_by_region.csv` produced by live-betting stage
`GET /predict/odds/` (Fonbet 1X2 odds)	✅ Implemented	Reads `fonbet_odds.parquet`
`GET /predict/{match_id}` (sync Celery dispatch, 30 s timeout)	✅ Implemented	Features from parquet → Celery `ml` queue
`GET /predict/model/info` (MLflow registry metadata, sync)	✅ Implemented	Shows loaded model version and metrics
`GET /monitoring/task_status/{task_id}` (poll Celery result)	✅ Implemented	Status + result polling
`GET /monitoring/celery/queues` (queue stats)	✅ Implemented	Active/scheduled/reserved counts
`GET /monitoring/celery/workers` (worker stats)	✅ Implemented	Active queues + ping
`GET /monitoring/drift` (Evidently drift report)	✅ Implemented	Reads `reports/drift/latest.json`; refreshes Prometheus gauge
`GET /healthcheck/` (liveness — DB connectivity check)	✅ Implemented	K8s liveness probe
`GET /metrics` (Prometheus)	✅ Implemented	10 counters/histograms/gauges
Pydantic request validation	✅ Implemented	`src/app/schemas/predict.py`
Model lazy-loading from MLflow registry	✅ Implemented	Once per Celery worker process
Streamlit UI	✅ Implemented	`src/streamlit/main.py` — match list with predictions, Fonbet odds, Value bets signal, prediction accuracy (Pred), dynamic ROI panel, Min region ROI slider, filters: Region / Status / Period
HTTP batch endpoint (`POST /predict/batch`)	❌ Out of scope	Batch inference is DVC-only by design — see ADR-0006
Grafana dashboards	✅ Deployed	Two dashboards in Helm chart: "Soccer — ML Quality & Betting", "SoccerPredictAI" (`k8s/helm/ns_soccer-api/files/dashboards/`)
Docker image	✅ Built	Multi-stage build
Kubernetes deployment	✅ Deployed	1 API pod, 1 Celery worker (default values.yaml; HPA template exists but `autoscaling.enabled: false`)
Helm chart	✅ Complete	Parameterized values

Inference flow (precomputed path — Streamlit UI)¶

Streamlit UI → GET /predict/cards/
                   ↓
    MatchCardService: merge predictions.parquet + fonbet_odds.parquet
    (in-memory cache, no MinIO call unless files changed)
                   ↓
    JSON response: probabilities + odds + outcome + Fonbet URL

Inference flow (live sync path)¶

Client → GET /predict/{match_id}?stage=champion
              ↓
    FeatureLookupService: read match_features.parquet (in-memory cache)
              ↓
    Celery task enqueued → ml queue
              ↓
    Worker: load model from MLflow registry (lazy, cached per worker)
              ↓
    Worker: model.predict_proba(features)
              ↓
    Task result polled (≤ 30 s)
              ↓
    FastAPI returns JSON response

Known limitations¶

HTTP batch endpoint is out of scope by design (see ADR-0006) — batch inference runs via DVC pipeline only.
RabbitMQ is single-broker (no clustering); task backlog possible under sustained load.
Model promotion from candidate to champion requires manual approval.
Grafana: two dashboards deployed in Helm chart (soccer-api.json, soccer-ml-quality.json).
/monitoring/* endpoints are currently unauthenticated (planned to restrict).

SLO targets (informal)¶

SLO	Target	Status
Sync p50 latency	≤ 50ms	Not formally measured
Sync p99 latency	≤ 200ms	Not formally measured
Service availability (30d)	≥ 99%	Not formally tracked

Load test baseline available via tests/load/locustfile.py.