Sync vs Async Inference Modes¶
Motivation¶
Different consumers have different requirements: - interactive users require low latency, - batch or heavy workloads tolerate higher latency.
Precomputed path (no Celery)¶
Served entirely from in-memory caches; no task queue involved.
Endpoints:
GET /predict/predictions/ # all precomputed predictions (display cols only)
GET /predict/precomputed/{match_id} # single match from batch_inference output
GET /predict/cards/ # predictions merged with Fonbet 1X2 odds
GET /predict/odds/ # Fonbet 1X2 odds only
GET /predict/region-roi/ # ROI by region from live-betting simulation
Characteristics: low latency, no worker dependency, served from in-memory parquet cache.
On-demand sync path (Celery dispatch)¶
Endpoints:
GET /predict/{match_id} # run prediction for one match; features from match_features.parquet
GET /predict/model/info # retrieve MLflow model metadata
Implementation:
- FastAPI handler dispatches a Celery task to the
mlqueue. - Blocks up to
_SYNC_TIMEOUT = 30s, pollingtask.stateevery 50 ms. - Returns result directly on success;
504 Gateway Timeouton timeout. PredictionServiceis initialised once per worker process viaworker_process_initsignal, avoiding repeated MLflow model loads.- Use
?stage=candidate(or any loaded alias) to target a non-default model. Loaded aliases are configured viaMLFLOW_MODEL_STAGES; defaults arechampion,candidate,smoke.
Characteristics:
- Strict 30 s SLO.
- Features read server-side from
match_features.parquet— no caller-supplied feature dict. - Immediate failure feedback (4xx/5xx).
When to use:
- UI-driven on-demand predictions.
- Real-time decision support.
Asynchronous inference¶
� Partially implemented — the supporting infrastructure is in place; the HTTP endpoint is not yet registered.
What exists:
- Request/response schemas: AsyncPredictRequest / AsyncPredictResponse (src/app/schemas/predict.py)
- Celery task: predict_match in src/app/tasks/predict.py (also used by the sync path)
- 202-accepted response helper: _task_accepted() in src/app/routers/predict.py
- Result polling: GET /monitoring/task_status/{task_id} (✅ operational)
What is missing: the @router.post("/async/") route binding in src/app/routers/predict.py.
Operational trade-offs¶
| Aspect | Sync | Async |
|---|---|---|
| Latency | Low (≤30 s SLO) | Higher (queue wait) |
| Throughput | Limited | High |
| Complexity | Lower | Higher |
| Failure mode | Immediate (504) | Deferred |
| UX | Direct response | Poll status_url |
Safety considerations¶
- Async jobs are idempotent — same
match_idre-submission is safe. - Failed tasks are retried up to
max_retries=2times with a 10 s delay before the error is propagated to the result backend. - Prometheus metric
prediction_requests_totalis incremented on every on-demand request dispatched to the Celery queue. The label is alwayssource="async"— this reflects the label name used in the implementation for all Celery dispatches, regardless of whether the HTTP caller is waiting synchronously or not. prediction_duration_secondsis defined in the metrics registry but is not yet instrumented in the router; it will always read zero.- On-demand prediction results are optionally cached in Redis.
PredictionServiceconnects toREDIS_CACHE_URL(defaults toCELERY_RESULT_BACKEND) on first use; if Redis is unreachable, caching is silently skipped. Cache key ispredict:{match_id}:{run_id}, TTL isPREDICTION_CACHE_TTLseconds (default 3600). Thecached: boolfield in the response reflects whether the result was a cache hit.