Skip to content

Serving

The serving layer exposes trained models as an operational inference service. It is deployed on Kubernetes, receives predictions requests over HTTP, dispatches inference tasks to Celery workers, and returns structured probability output.

See Current Serving Status for a full readiness matrix.


Responsibilities

The serving layer is responsible for:

  • exposing the canonical inference API (sync and async prediction endpoints),
  • validating request schemas before any inference logic runs,
  • dispatching inference tasks to the Celery ml queue,
  • loading the registered model from the MLflow Registry (once per worker process),
  • returning structured prediction output with model traceability metadata,
  • surfacing health and operational metrics.

The serving layer is not responsible for:

  • training or evaluating models — see ML,
  • promoting models between registry stages — see Model Registry,
  • scraping or storing raw data — see Data.

Scope of this section

Page Content
API Contract Canonical endpoint set, request/response schemas, error semantics
Examples Concrete cURL and Python examples for all implemented endpoints
Inference Modes Sync vs async execution paths and operational trade-offs
Deployment Serving-specific runtime components, configuration, model loading
Health & Failures Health probes, degraded-mode behavior, failure responses
Performance Latency behavior, SLO targets, interpretation guide
Current Status Authoritative readiness matrix for the serving subsystem

Boundaries

This section deepens those topics for the inference API subsystem specifically.


Current Status

This is the authoritative source of truth for what the inference API does today.

Last updated: May 27, 2026

Component status

Component Status Notes
GET /predict/predictions/ (bulk parquet read) ✅ Implemented All precomputed predictions, display cols only
GET /predict/precomputed/{match_id} (parquet lookup, no Celery) ✅ Implemented Direct cache read, 404 if absent
GET /predict/cards/ (predictions + Fonbet odds merged) ✅ Implemented Primary Streamlit UI data source
GET /predict/region-roi/ (live-betting ROI per region) ✅ Implemented From roi_by_region.csv produced by live-betting stage
GET /predict/odds/ (Fonbet 1X2 odds) ✅ Implemented Reads fonbet_odds.parquet
GET /predict/{match_id} (sync Celery dispatch, 30 s timeout) ✅ Implemented Features from parquet → Celery ml queue
GET /predict/model/info (MLflow registry metadata, sync) ✅ Implemented Shows loaded model version and metrics
GET /monitoring/task_status/{task_id} (poll Celery result) ✅ Implemented Status + result polling
GET /monitoring/celery/queues (queue stats) ✅ Implemented Active/scheduled/reserved counts
GET /monitoring/celery/workers (worker stats) ✅ Implemented Active queues + ping
GET /monitoring/drift (Evidently drift report) ✅ Implemented Reads reports/drift/latest.json; refreshes Prometheus gauge
GET /healthcheck/ (liveness — DB connectivity check) ✅ Implemented K8s liveness probe
GET /metrics (Prometheus) ✅ Implemented 9 counters/histograms/gauges
Pydantic request validation ✅ Implemented src/app/schemas/predict.py
Model lazy-loading from MLflow registry ✅ Implemented Once per Celery worker process
Streamlit UI ✅ Implemented src/ui/app/main.py — match list with predictions, Fonbet odds, Value bets signal, prediction accuracy (Pred), dynamic ROI panel, Min region ROI slider, filters: Region / Status / Period
HTTP batch endpoint (POST /predict/batch) ❌ Out of scope Batch inference is DVC-only by design — see ADR-0006
Grafana dashboard for API metrics 📋 Planned Prometheus collecting; dashboards not yet deployed
Docker image ✅ Built Multi-stage build
Kubernetes deployment ✅ Deployed 1 API pod, 1 Celery worker (default values.yaml; HPA template exists but autoscaling.enabled: false)
Helm chart ✅ Complete Parameterized values

Inference flow (precomputed path — Streamlit UI)

Streamlit UI → GET /predict/cards/
    MatchCardService: merge predictions.parquet + fonbet_odds.parquet
    (in-memory cache, no MinIO call unless files changed)
    JSON response: probabilities + odds + outcome + Fonbet URL

Inference flow (live sync path)

Client → GET /predict/{match_id}?stage=champion
    FeatureLookupService: read match_features.parquet (in-memory cache)
    Celery task enqueued → ml queue
    Worker: load model from MLflow registry (lazy, cached per worker)
    Worker: model.predict_proba(features)
    Task result polled (≤ 30 s)
    FastAPI returns JSON response

Known limitations

  • HTTP batch endpoint is out of scope by design (see ADR-0006) — batch inference runs via DVC pipeline only.
  • RabbitMQ is single-broker (no clustering); task backlog possible under sustained load.
  • Model promotion from candidate to champion requires manual approval.
  • Grafana dashboards not yet deployed — Prometheus is exporting data.
  • /monitoring/* endpoints are currently unauthenticated (planned to restrict).

SLO targets (informal)

SLO Target Status
Sync p50 latency ≤ 50ms Not formally measured
Sync p99 latency ≤ 200ms Not formally measured
Service availability (30d) ≥ 99% Not formally tracked

Load test baseline available via tests/load/locustfile.py.