Runtime View¶
This page describes how the system behaves at runtime for its two core prediction paths: synchronous prediction and asynchronous prediction. It also covers cache interaction, model loading, and runtime invariants.
Runtime Invariants¶
These are architectural invariants — properties that must hold regardless of how implementation details evolve:
- The model serving any request is always a registered MLflow model with the
championalias. No local model file is ever loaded directly. - For on-demand inference (
GET /predict/{match_id}), features are read server-side frommatch_features.parquet; no caller-supplied feature dict is accepted. - Feature assembly at inference time uses the same code path (
src/features/) as the offline pipeline. There is no duplicate feature implementation for serving. - Model promotion is the explicit, logged handoff point between the offline pipeline and the serving layer. Models enter serving only via the MLflow Registry.
Sync Prediction Path¶
Endpoint: GET /predict/{match_id}
Latency bound: bounded by Celery ml queue timeout (see Operational Targets)
sequenceDiagram
participant Client
participant FastAPI
participant RabbitMQ
participant WorkerML
Client->>FastAPI: GET /predict/{match_id}
FastAPI->>FastAPI: Validate X-API-Key header
FastAPI->>RabbitMQ: Enqueue inference task → ml queue
RabbitMQ->>WorkerML: Pick up inference task
WorkerML->>WorkerML: Assemble features (src/features/)
WorkerML->>WorkerML: Run inference (lazily loaded champion model)
WorkerML-->>FastAPI: Task result (30s timeout)
FastAPI-->>Client: 200 OK {probabilities, model_version}
Step-by-step:
1. Client sends GET /predict/{match_id} with X-API-Key header.
2. FastAPI validates the API key; returns 401 on failure.
3. FastAPI enqueues an inference task on the Celery ml queue via RabbitMQ.
4. celery-worker-ml picks up the task.
5. Worker assembles feature vectors using src/features/ (same logic as offline pipeline).
6. Worker runs inference using the lazily loaded champion model (loaded once per worker process on first request).
7. FastAPI receives the task result (sync wait, 30s timeout) and returns it to the client.
8. On timeout: FastAPI returns 504 Gateway Timeout.
Async Prediction Path¶
Status: � Partial — Celery task, Pydantic schemas, and
GET /monitoring/task_status/{task_id}polling endpoint are operational. ThePOST /predict/async/HTTP route is not yet registered insrc/app/routers/predict.py. See Implementation Status and Roadmap.
The intended design:
- POST /predict/async/ → submit job, return task_id
- GET /monitoring/task_status/{task_id} → poll for result
This path follows the same Celery dispatch pattern as the sync path but returns 202 Accepted immediately and allows the client to poll for completion.
Cache Interaction Rules¶
| Scenario | Behavior |
|---|---|
Sync GET /predict/{match_id} — cache hit |
PredictionService._cache_get() returns cached dict; cached=true in response; Celery task skipped |
Sync GET /predict/{match_id} — cache miss |
FastAPI dispatches predict_match task to Celery ml queue; result cached after completion |
Precomputed GET /predict/precomputed/{match_id} |
Read from in-memory predictions.parquet; no Celery, no Redis |
| Cache key | predict:{match_id}:{run_id} — auto-invalidates when model run_id changes |
| Cache TTL | PREDICTION_CACHE_TTL env var (default 3600 s) |
| Redis unavailable | Cache silently bypassed; inference proceeds normally via Celery |
Model promotion to champion |
Cache entries from previous model auto-invalidate (different run_id in key) — entries with old key are not actively flushed, but will never be found under the new key |
Cache key structure (planned extension): keyed on a hash of the input feature vector; cache key to include model version or invalidate on promotion event.
Cache Strategy¶
Redis serves as an optimization layer — not a source of truth, and not an architectural invariant. The inference path must function correctly without it (see Degraded Modes).
Role of Redis in this system:
- Reduces redundant inference for repeated queries about the same match input.
- Stores assembled prediction results, not raw feature vectors.
- Operates on a TTL-based expiry: no entry persists indefinitely.
Cache key structure (conceptual):
Cache keys are derived from a deterministic hash of the input feature vector. This ensures that: - Structurally identical inputs (same match context) produce the same key. - Structurally different inputs (different match context or feature values) always produce different keys. - The key is not tied to the model version today (a known limitation — see Consistency Model below).
Keys for async task results are keyed by task_id, not by input hash.
TTL-based expiration:
All cache entries expire automatically after a configured TTL. There is no manual expiration logic in the normal path. The TTL is a trade-off parameter: - Too short → poor cache hit rate; redundant inference. - Too long → stale predictions served after match conditions change or a new model is promoted.
Consistency Model¶
Redis provides eventual consistency with respect to model version. This is an accepted trade-off, not a defect.
What this means in practice:
- After a new model is promoted to
champion, existing cache entries produced by the previous model remain valid until TTL expires. - Stale predictions from the previous model will be served to clients during the TTL window after a promotion event.
- There is no strict synchronization between the MLflow Registry (model promotion) and the Redis cache (cached predictions).
Why this is acceptable:
- Predictions are probabilistic and advisory. A small window of stale results does not constitute a system failure.
- The TTL window is bounded. Stale entries are not permanent.
- The operational complexity of tight cache-model synchronization is not justified at current scale or SLA.
Current status and limitation:
Cache invalidation on model promotion is not implemented. It is documented as a known limitation in Known Architectural Limitations and as a planned improvement in Roadmap.
Model Loading Rules¶
- The model is loaded lazily: on first inference request in each worker process, not at startup.
- The
model_uriis resolved from MLflow Registry using thechampionalias. - The same
PredictionServicesingleton is reused for all subsequent requests in that process. - Model artifact includes preprocessing steps packaged as an MLflow
pyfuncwrapper.
Consequence: the first request to a freshly started worker is slower (model load time). Subsequent requests serve from the in-process loaded model.
Precomputed (Batch) Lookup Path¶
Endpoint: GET /predict/precomputed/{match_id}
This is a separate, lighter path that serves pre-computed predictions without triggering Celery:
- The
batch_inferenceDVC stage assembles feature vectors for all upcoming matches and writes predictions todata/predictions/predictions.parquet. - The FastAPI
GET /predict/precomputed/{match_id}handler reads from this parquet file at request time. - No live model inference or Celery dispatch is triggered; the response is a fast parquet lookup.
- The parquet file is refreshed by running the DVC pipeline (typically scheduled via Airflow).
Performance Notes¶
Concrete latency targets are documented in System Requirements — Operational Targets. This section describes the qualitative profile of each path.
| Path | Latency profile | Bottleneck |
|---|---|---|
Precomputed lookup (GET /predict/precomputed/{match_id}) |
Sub-second | Parquet file read; no model inference |
Sync Celery dispatch (GET /predict/{match_id}) |
Bounded by Celery ml queue timeout |
Feature assembly + model inference + RabbitMQ round-trip |
| Async submission (📋 Planned) | Near-instant | FastAPI enqueue only; no blocking on result |
| Model load (first request per worker) | One-time cost at startup | MLflow artifact download from MinIO |
Degraded Modes¶
The system is designed to continue serving predictions when non-critical components are unavailable.
| Degraded component | Behavior | Severity |
|---|---|---|
| Redis unavailable | No impact today (cache not implemented at router level); future: cache bypassed, all requests fall through to Celery | P3 — no current impact |
| MLflow unavailable (registry read) | Running workers continue to serve from already-loaded model; new worker processes fail to start | P2 — no impact until worker restart |
| RabbitMQ unavailable | All live inference fails (GET /predict/{match_id}); precomputed lookups unaffected |
P1 — live inference unavailable |
| Celery worker-ml down | Sync requests time out; precomputed lookups unaffected | P1 — live inference unavailable |
See Failure Modes for recovery procedures.
Current Limitations¶
| Limitation | Impact | Status |
|---|---|---|
| No Redis HA | Redis pod failure = cache miss on all requests; sync path still works (slower) | 🚧 Known |
| No streaming inference | Batch-only feature input | 📋 Planned |
| Cache not invalidated on model promotion | Stale results served until TTL | 📋 Planned fix |
| Single RabbitMQ broker | Queue unavailable = all inference fails | 🚧 Known |
Latency Trade-offs¶
The sync inference path (POST /predict) is bounded by the Celery ml queue timeout (p95 < 30 s,
per Operational Targets). This is intentionally generous. Several design decisions
contribute to this latency profile:
Why sync inference can take up to 30 s:
- Feature computation is expensive. Feature assembly at inference time reuses the same logic from
src/features/as the offline pipeline. This means time-windowed statistical aggregations, ELO calculations, and other stateful computations run in the request path. - Offline logic is reused by design, not by accident. The decision to share feature code between training and serving is an explicit architectural invariant (see Runtime Invariants). This eliminates training/serving skew at the cost of inference latency. The trade-off is accepted.
- No precompute layer exists yet. There is no online feature store or precomputed feature cache. Every cache miss triggers full feature assembly from scratch. This is a known limitation.
- Model load on first request. A freshly started worker must download and load the model artifact from MinIO/MLflow before the first inference completes. Subsequent requests are faster (in-process model).
- RabbitMQ round-trip. Every cache miss dispatches through RabbitMQ, adding message broker round-trip latency to the total.
The 30 s bound is a p95 operational target, not a hard SLA. It reflects the current implementation profile, not an architectural ceiling.
Why Sync Path Uses Celery¶
The synchronous POST /predict endpoint dispatches to a Celery worker and waits for the result,
rather than executing inference inline in the FastAPI process. This appears to add latency — and it does.
The trade-off is deliberate.
Reasons for this design:
- Uniform execution path. Sync and async inference both run in
celery-worker-ml. There is a single inference implementation, not two divergent paths. This eliminates the risk of sync and async paths producing different results from the same input. - Process isolation. FastAPI HTTP workers remain responsive under model load, feature assembly, or I/O delays. A slow inference job in a Celery worker cannot block the HTTP process.
- Reuse of async infrastructure. The same queue, the same workers, and the same feature/model code are used for both sync and async jobs. There is no separate serving stack to maintain.
- Predictable failure modes. Task timeout, queue unavailability, and worker crash all have well-defined outcomes and recovery paths. Inline inference in FastAPI would scatter these failure modes across the HTTP layer.
The trade-off: sync latency includes Celery dispatch overhead (RabbitMQ round-trip + task pickup). This is acceptable given the 30 s budget and the benefit of a single unified inference path.
Future Optimization Paths¶
These are potential improvements to the inference latency profile. They are listed here as architectural options, not commitments. None are currently implemented.
- Precomputed features (offline feature assembly): Run feature assembly as a batch job before match time; store feature vectors in a lookup store. Inference at request time becomes a vector lookup + model forward pass only.
- Lighter model for sync path: Replace or supplement the current model with a faster variant (e.g., smaller gradient boosting tree, logistic regression) for the latency-sensitive sync path. The current model would remain available for async batch scoring.
- Dedicated online feature store (optional): A purpose-built feature store (e.g., Feast, Redis-backed) could decouple feature freshness from request-time computation. Not warranted at current scale.
- Cache warming on model promotion: When a new
championmodel is promoted, pre-populate the Redis cache for known upcoming matches. This avoids the first-request cold-start penalty after a model switch. - Worker pre-warming: Start inference workers and trigger a dummy prediction at deploy time to force model load before the first real request.
Related¶
- Container View — Celery worker types, Redis, RabbitMQ
- Component View — inference components and feature assembly
- Failure Modes — what happens when cache, queue, or model is unavailable
- Serving — API Contract — full endpoint specification