Sync vs Async Inference Modes¶

Motivation¶

Different consumers have different requirements: - interactive users require low latency, - batch or heavy workloads tolerate higher latency.

Precomputed path (no Celery)¶

Served entirely from in-memory caches; no task queue involved.

Endpoints:

GET /predict/predictions/           # all precomputed predictions (display cols only)
GET /predict/precomputed/{match_id} # single match from batch_inference output
GET /predict/cards/                 # predictions merged with Fonbet 1X2 odds
GET /predict/odds/                  # Fonbet 1X2 odds only
GET /predict/region-roi/            # ROI by region from live-betting simulation

Characteristics: low latency, no worker dependency, served from in-memory parquet cache.

On-demand sync path (Celery dispatch)¶

Endpoints:

GET /predict/{match_id}  # run prediction for one match; features from match_features.parquet
GET /predict/model/info  # retrieve MLflow model metadata

Implementation:

FastAPI handler dispatches a Celery task to the ml queue.
Blocks up to _SYNC_TIMEOUT = 30s, polling task.state every 50 ms.
Returns result directly on success; 504 Gateway Timeout on timeout.
PredictionService is initialised once per worker process via worker_process_init signal, avoiding repeated MLflow model loads.
Use ?stage=candidate (or any loaded alias) to target a non-default model. Loaded aliases are configured via MLFLOW_MODEL_STAGES; defaults are champion, candidate, smoke.

Characteristics:

Strict 30 s SLO.
Features read server-side from match_features.parquet — no caller-supplied feature dict.
Immediate failure feedback (4xx/5xx).

When to use:

UI-driven on-demand predictions.
Real-time decision support.

Asynchronous inference¶

� Partially implemented — the supporting infrastructure is in place; the HTTP endpoint is not yet registered.

What exists: - Request/response schemas: AsyncPredictRequest / AsyncPredictResponse (src/app/schemas/predict.py) - Celery task: predict_match in src/app/tasks/predict.py (also used by the sync path) - 202-accepted response helper: _task_accepted() in src/app/routers/predict.py - Result polling: GET /monitoring/task_status/{task_id} (✅ operational)

What is missing: the @router.post("/async/") route binding in src/app/routers/predict.py.

Operational trade-offs¶

Aspect	Sync	Async
Latency	Low (≤30 s SLO)	Higher (queue wait)
Throughput	Limited	High
Complexity	Lower	Higher
Failure mode	Immediate (504)	Deferred
UX	Direct response	Poll `status_url`

Safety considerations¶

Async jobs are idempotent — same match_id re-submission is safe.
Failed tasks are retried up to max_retries=2 times with a 10 s delay before the error is propagated to the result backend.
Prometheus metric prediction_requests_total is incremented on every on-demand request dispatched to the Celery queue. The label is always source="async" — this reflects the label name used in the implementation for all Celery dispatches, regardless of whether the HTTP caller is waiting synchronously or not.
prediction_duration_seconds is defined in the metrics registry but is not yet instrumented in the router; it will always read zero.
On-demand prediction results are optionally cached in Redis. PredictionService connects to REDIS_CACHE_URL (defaults to CELERY_RESULT_BACKEND) on first use; if Redis is unreachable, caching is silently skipped. Cache key is predict:{match_id}:{run_id}, TTL is PREDICTION_CACHE_TTL seconds (default 3600). The cached: bool field in the response reflects whether the result was a cache hit.