Skip to content

Performance & SLOs

📋 Measurement status: SLO targets below are informal operational goals, not contractual SLOs backed by continuous measurement. Prometheus exports latency histograms; Grafana dashboards are not yet deployed. See Monitoring Status and Roadmap.

This page explains how to interpret serving performance figures, what is currently measured, and what informal SLO targets exist.


Measurement status

SLO Target Measurement status
Sync GET /predict/{match_id} p50 latency ≤ 50 ms Not formally measured; from manual testing
Sync GET /predict/{match_id} p99 latency ≤ 200 ms Not formally measured; from manual testing
Service availability (30-day) ≥ 99% Not formally tracked
Async job completion p95 ≤ 10 s (typical) Not formally measured
Sync hard timeout 30 s Enforced in code — not a target, a bound

These targets are informal operational goals, not contractual SLOs backed by continuous measurement infrastructure. Prometheus exports latency histograms; Grafana dashboards are not yet deployed. See Status.

A Locust-based load test baseline is available at tests/load/locustfile.py.


Sync inference latency

The sync path (GET /predict/{match_id}) involves:

  1. Pydantic request validation (< 1 ms)
  2. Redis cache lookup (< 5 ms when Redis is healthy)
  3. Cache hit: return immediately — sub-10 ms total
  4. Cache miss: Celery task enqueued → worker picks up → inference → result stored in Redis

On cache miss the dominant cost is:

  • Worker queue wait time (near-zero at low load; grows under backlog)
  • Model predict_proba() call (< 10 ms for current model size)
  • Feature assembly from pre-computed batch data (< 5 ms)

The 30 s hard timeout (_SYNC_TIMEOUT) is a ceiling, not a performance target. A timeout response (504 Gateway Timeout) indicates a worker backlog or worker failure, not normal operation.


Async inference

📋 PlannedPOST /predict/async/ is not yet implemented.

When implemented, the async path would return immediately (< 20 ms). Task completion time depends on:

  • Worker availability (how many celery-worker-ml pods are running)
  • Queue depth at submission time
  • Model inference time

Under normal conditions with 2 workers and low load, completion is typically < 5 s. Under sustained load or worker restarts, tasks queue in RabbitMQ until a worker is available.


Worker cold start / lazy model load

The first inference request handled by a freshly started celery-worker-ml process triggers a model load from the MLflow Registry. This load takes a few seconds. All subsequent requests in that process reuse the loaded model — no per-request load.

Under HPA scale-up, new worker pods incur this one-time cold start cost before contributing to throughput.


Payload and feature lookup characteristics

  • Feature vectors are fixed-width flat dicts; payload size is small (< 2 KB).
  • Batch-lookup (GET /predict/{match_id}) reads from a pre-computed parquet file; no inference occurs.
  • Prediction caching (Redis) eliminates repeated inference for the same input within the TTL window.

How to read performance claims

When interpreting any latency figure in this project's documentation:

  • "Measured" means a load test or profiling run produced the number.
  • "Target" means an informal operational goal — not yet backed by continuous monitoring.
  • "Bound" means a hard limit enforced in code (e.g., 30 s timeout).

The values on this page are targets and bounds, not measured facts, unless explicitly stated otherwise.