Runtime View¶

This page describes how the system behaves at runtime for its two core prediction paths: synchronous prediction and asynchronous prediction. It also covers cache interaction, model loading, and runtime invariants.

Runtime Invariants¶

These are architectural invariants — properties that must hold regardless of how implementation details evolve:

The model serving any request is always a registered MLflow model with the champion alias. No local model file is ever loaded directly.
For on-demand inference (GET /predict/{match_id}), features are read server-side from match_features.parquet; no caller-supplied feature dict is accepted.
Feature assembly at inference time uses the same code path (src/features/) as the offline pipeline. There is no duplicate feature implementation for serving.
Model promotion is the explicit, logged handoff point between the offline pipeline and the serving layer. Models enter serving only via the MLflow Registry.

Sync Prediction Path¶

Endpoint: GET /predict/{match_id}

Latency bound: bounded by Celery ml queue timeout (see Operational Targets)

sequenceDiagram
    participant Client
    participant FastAPI
    participant RabbitMQ
    participant WorkerML

    Client->>FastAPI: GET /predict/{match_id}
    FastAPI->>FastAPI: Validate X-API-Key header
    FastAPI->>RabbitMQ: Enqueue inference task → ml queue
    RabbitMQ->>WorkerML: Pick up inference task
    WorkerML->>WorkerML: Assemble features (src/features/)
    WorkerML->>WorkerML: Run inference (lazily loaded champion model)
    WorkerML-->>FastAPI: Task result (30s timeout)
    FastAPI-->>Client: 200 OK {probabilities, model_version}

Step-by-step: 1. Client sends GET /predict/{match_id} with X-API-Key header. 2. FastAPI validates the API key; returns 401 on failure. 3. FastAPI enqueues an inference task on the Celery ml queue via RabbitMQ. 4. celery-worker-ml picks up the task. 5. Worker assembles feature vectors using src/features/ (same logic as offline pipeline). 6. Worker runs inference using the lazily loaded champion model (loaded once per worker process on first request). 7. FastAPI receives the task result (sync wait, 30s timeout) and returns it to the client. 8. On timeout: FastAPI returns 504 Gateway Timeout.

Async Prediction Path¶

Status: � Partial — Celery task, Pydantic schemas, and GET /monitoring/task_status/{task_id} polling endpoint are operational. The POST /predict/async/ HTTP route is not yet registered in src/app/routers/predict.py. See Implementation Status and Roadmap.

The intended design: - POST /predict/async/ → submit job, return task_id - GET /monitoring/task_status/{task_id} → poll for result

This path follows the same Celery dispatch pattern as the sync path but returns 202 Accepted immediately and allows the client to poll for completion.

Cache Interaction Rules¶

Scenario	Behavior
Sync `GET /predict/{match_id}` — cache hit	`PredictionService._cache_get()` returns cached dict; `cached=true` in response; Celery task skipped
Sync `GET /predict/{match_id}` — cache miss	FastAPI dispatches `predict_match` task to Celery `ml` queue; result cached after completion
Precomputed `GET /predict/precomputed/{match_id}`	Read from in-memory `predictions.parquet`; no Celery, no Redis
Cache key	`predict:{match_id}:{run_id}` — auto-invalidates when model `run_id` changes
Cache TTL	`PREDICTION_CACHE_TTL` env var (default 3600 s)
Redis unavailable	Cache silently bypassed; inference proceeds normally via Celery
Model promotion to `champion`	Cache entries from previous model auto-invalidate (different `run_id` in key) — entries with old key are not actively flushed, but will never be found under the new key

Cache key structure (planned extension): keyed on a hash of the input feature vector; cache key to include model version or invalidate on promotion event.

Cache Strategy¶

Redis serves as an optimization layer — not a source of truth, and not an architectural invariant. The inference path must function correctly without it (see Degraded Modes).

Role of Redis in this system:

Reduces redundant inference for repeated queries about the same match input.
Stores assembled prediction results, not raw feature vectors.
Operates on a TTL-based expiry: no entry persists indefinitely.

Cache key structure (conceptual):

Cache keys are derived from a deterministic hash of the input feature vector. This ensures that: - Structurally identical inputs (same match context) produce the same key. - Structurally different inputs (different match context or feature values) always produce different keys. - The key is not tied to the model version today (a known limitation — see Consistency Model below).

Keys for async task results are keyed by task_id, not by input hash.

TTL-based expiration:

All cache entries expire automatically after a configured TTL. There is no manual expiration logic in the normal path. The TTL is a trade-off parameter: - Too short → poor cache hit rate; redundant inference. - Too long → stale predictions served after match conditions change or a new model is promoted.

Consistency Model¶

Redis provides eventual consistency with respect to model version. This is an accepted trade-off, not a defect.

What this means in practice:

After a new model is promoted to champion, existing cache entries produced by the previous model remain valid until TTL expires.
Stale predictions from the previous model will be served to clients during the TTL window after a promotion event.
There is no strict synchronization between the MLflow Registry (model promotion) and the Redis cache (cached predictions).

Why this is acceptable:

Predictions are probabilistic and advisory. A small window of stale results does not constitute a system failure.
The TTL window is bounded. Stale entries are not permanent.
The operational complexity of tight cache-model synchronization is not justified at current scale or SLA.

Current status and limitation:

Cache invalidation on model promotion is not implemented. It is documented as a known limitation in Known Architectural Limitations and as a planned improvement in Roadmap.

Model Loading Rules¶

The model is loaded lazily: on first inference request in each worker process, not at startup.
The model_uri is resolved from MLflow Registry using the champion alias.
The same PredictionService singleton is reused for all subsequent requests in that process.
Model artifact includes preprocessing steps packaged as an MLflow pyfunc wrapper.

Consequence: the first request to a freshly started worker is slower (model load time). Subsequent requests serve from the in-process loaded model.

Precomputed (Batch) Lookup Path¶

Endpoint: GET /predict/precomputed/{match_id}

This is a separate, lighter path that serves pre-computed predictions without triggering Celery:

The batch_inference DVC stage assembles feature vectors for all upcoming matches and writes predictions to data/predictions/predictions.parquet.
The FastAPI GET /predict/precomputed/{match_id} handler reads from this parquet file at request time.
No live model inference or Celery dispatch is triggered; the response is a fast parquet lookup.
The parquet file is refreshed by running the DVC pipeline (typically scheduled via Airflow).

Performance Notes¶

Concrete latency targets are documented in System Requirements — Operational Targets. This section describes the qualitative profile of each path.

Path	Latency profile	Bottleneck
Precomputed lookup (`GET /predict/precomputed/{match_id}`)	Sub-second	Parquet file read; no model inference
Sync Celery dispatch (`GET /predict/{match_id}`)	Bounded by Celery `ml` queue timeout	Feature assembly + model inference + RabbitMQ round-trip
Async submission (📋 Planned)	Near-instant	FastAPI enqueue only; no blocking on result
Model load (first request per worker)	One-time cost at startup	MLflow artifact download from MinIO

Degraded Modes¶

The system is designed to continue serving predictions when non-critical components are unavailable.

Degraded component	Behavior	Severity
Redis unavailable	No impact today (cache not implemented at router level); future: cache bypassed, all requests fall through to Celery	P3 — no current impact
MLflow unavailable (registry read)	Running workers continue to serve from already-loaded model; new worker processes fail to start	P2 — no impact until worker restart
RabbitMQ unavailable	All live inference fails (`GET /predict/{match_id}`); precomputed lookups unaffected	P1 — live inference unavailable
Celery worker-ml down	Sync requests time out; precomputed lookups unaffected	P1 — live inference unavailable

See Failure Modes for recovery procedures.

Current Limitations¶

Limitation	Impact	Status
No Redis HA	Redis pod failure = cache miss on all requests; sync path still works (slower)	🚧 Known
No streaming inference	Batch-only feature input	📋 Planned
Cache not invalidated on model promotion	Stale results served until TTL	📋 Planned fix
Single RabbitMQ broker	Queue unavailable = all inference fails	🚧 Known

Latency Trade-offs¶

The sync inference path (POST /predict) is bounded by the Celery ml queue timeout (p95 < 30 s, per Operational Targets). This is intentionally generous. Several design decisions contribute to this latency profile:

Why sync inference can take up to 30 s:

Feature computation is expensive. Feature assembly at inference time reuses the same logic from src/features/ as the offline pipeline. This means time-windowed statistical aggregations, ELO calculations, and other stateful computations run in the request path.
Offline logic is reused by design, not by accident. The decision to share feature code between training and serving is an explicit architectural invariant (see Runtime Invariants). This eliminates training/serving skew at the cost of inference latency. The trade-off is accepted.
No precompute layer exists yet. There is no online feature store or precomputed feature cache. Every cache miss triggers full feature assembly from scratch. This is a known limitation.
Model load on first request. A freshly started worker must download and load the model artifact from MinIO/MLflow before the first inference completes. Subsequent requests are faster (in-process model).
RabbitMQ round-trip. Every cache miss dispatches through RabbitMQ, adding message broker round-trip latency to the total.

The 30 s bound is a p95 operational target, not a hard SLA. It reflects the current implementation profile, not an architectural ceiling.

Why Sync Path Uses Celery¶

The synchronous POST /predict endpoint dispatches to a Celery worker and waits for the result, rather than executing inference inline in the FastAPI process. This appears to add latency — and it does. The trade-off is deliberate.

Reasons for this design:

Uniform execution path. Sync and async inference both run in celery-worker-ml. There is a single inference implementation, not two divergent paths. This eliminates the risk of sync and async paths producing different results from the same input.
Process isolation. FastAPI HTTP workers remain responsive under model load, feature assembly, or I/O delays. A slow inference job in a Celery worker cannot block the HTTP process.
Reuse of async infrastructure. The same queue, the same workers, and the same feature/model code are used for both sync and async jobs. There is no separate serving stack to maintain.
Predictable failure modes. Task timeout, queue unavailability, and worker crash all have well-defined outcomes and recovery paths. Inline inference in FastAPI would scatter these failure modes across the HTTP layer.

The trade-off: sync latency includes Celery dispatch overhead (RabbitMQ round-trip + task pickup). This is acceptable given the 30 s budget and the benefit of a single unified inference path.

Future Optimization Paths¶

These are potential improvements to the inference latency profile. They are listed here as architectural options, not commitments. None are currently implemented.

Precomputed features (offline feature assembly): Run feature assembly as a batch job before match time; store feature vectors in a lookup store. Inference at request time becomes a vector lookup + model forward pass only.
Lighter model for sync path: Replace or supplement the current model with a faster variant (e.g., smaller gradient boosting tree, logistic regression) for the latency-sensitive sync path. The current model would remain available for async batch scoring.
Dedicated online feature store (optional): A purpose-built feature store (e.g., Feast, Redis-backed) could decouple feature freshness from request-time computation. Not warranted at current scale.
Cache warming on model promotion: When a new champion model is promoted, pre-populate the Redis cache for known upcoming matches. This avoids the first-request cold-start penalty after a model switch.
Worker pre-warming: Start inference workers and trigger a dummy prediction at deploy time to force model load before the first real request.

Container View — Celery worker types, Redis, RabbitMQ
Component View — inference components and feature assembly
Failure Modes — what happens when cache, queue, or model is unavailable
Serving — API Contract — full endpoint specification