Performance & SLOs¶

📋 Measurement status: SLO targets below are informal operational goals, not contractual SLOs backed by continuous measurement. Prometheus exports latency histograms; two Grafana dashboards are deployed but do not yet include SLO tracking panels. See Monitoring Status and Roadmap.

This page explains how to interpret serving performance figures, what is currently measured, and what informal SLO targets exist.

Measurement status¶

SLO	Target	Measurement status
Sync `GET /predict/{match_id}` p50 latency	≤ 50 ms	Not formally measured; from manual testing
Sync `GET /predict/{match_id}` p99 latency	≤ 200 ms	Not formally measured; from manual testing
Service availability (30-day)	≥ 99%	Not formally tracked
Async job completion p95	≤ 10 s (typical)	Not formally measured
Sync hard timeout	30 s	Enforced in code — not a target, a bound

These targets are informal operational goals, not contractual SLOs backed by continuous measurement infrastructure. Prometheus exports latency histograms; two Grafana dashboards are deployed (see Status).

A Locust-based load test baseline is available at tests/load/locustfile.py.

Locust load test baseline (local dev)¶

The following figures come from a Locust run against a local dev stack (3 concurrent users). They represent a measured baseline, not a production target.

Endpoint	p50	p95	p99	RPS (tested)
`GET /predict/precomputed/{id}`	—	442 ms	460 ms	~18 RPS

Production ceiling on single-node K8s is estimated at ~5 RPS due to resource constraints. Local dev figures are higher because there is no network overhead or container resource limits.

Sync inference latency¶

The sync path (GET /predict/{match_id}) involves:

Pydantic request validation (< 1 ms)
Redis cache lookup (< 5 ms when Redis is healthy)
Cache hit: return immediately — sub-10 ms total
Cache miss: Celery task enqueued → worker picks up → inference → result stored in Redis

On cache miss the dominant cost is:

Worker queue wait time (near-zero at low load; grows under backlog)
Model predict_proba() call (< 10 ms for current model size)
Feature assembly from pre-computed batch data (< 5 ms)

The 30 s hard timeout (_SYNC_TIMEOUT) is a ceiling, not a performance target. A timeout response (504 Gateway Timeout) indicates a worker backlog or worker failure, not normal operation.

Async inference¶

📋 Planned — POST /predict/async/ is not yet implemented.

When implemented, the async path would return immediately (< 20 ms). Task completion time depends on:

Worker availability (how many celery-worker-ml pods are running)
Queue depth at submission time
Model inference time

Under normal conditions with 2 workers and low load, completion is typically < 5 s. Under sustained load or worker restarts, tasks queue in RabbitMQ until a worker is available.

Worker cold start / lazy model load¶

The first inference request handled by a freshly started celery-worker-ml process triggers a model load from the MLflow Registry. This load takes a few seconds. All subsequent requests in that process reuse the loaded model — no per-request load.

Under HPA scale-up, new worker pods incur this one-time cold start cost before contributing to throughput.

Payload and feature lookup characteristics¶

Feature vectors are fixed-width flat dicts; payload size is small (< 2 KB).
Batch-lookup (GET /predict/{match_id}) reads from a pre-computed parquet file; no inference occurs.
Prediction caching (Redis) eliminates repeated inference for the same input within the TTL window.

How to read performance claims¶

When interpreting any latency figure in this project's documentation:

"Measured" means a load test or profiling run produced the number.
"Target" means an informal operational goal — not yet backed by continuous monitoring.
"Bound" means a hard limit enforced in code (e.g., 30 s timeout).

The values on this page are targets and bounds, not measured facts, unless explicitly stated otherwise.

Inference Modes — sync vs async path details
Status — SLO table with measurement status
Architecture: Runtime View