Skip to content

Sync vs Async Inference Modes


Motivation

Different consumers have different requirements: - interactive users require low latency, - batch or heavy workloads tolerate higher latency.


Precomputed path (no Celery)

Served entirely from in-memory caches; no task queue involved.

Endpoints:

GET /predict/predictions/           # all precomputed predictions (display cols only)
GET /predict/precomputed/{match_id} # single match from batch_inference output
GET /predict/cards/                 # predictions merged with Fonbet 1X2 odds
GET /predict/odds/                  # Fonbet 1X2 odds only
GET /predict/region-roi/            # ROI by region from live-betting simulation

Characteristics: low latency, no worker dependency, served from in-memory parquet cache.


On-demand sync path (Celery dispatch)

Endpoints:

GET /predict/{match_id}  # run prediction for one match; features from match_features.parquet
GET /predict/model/info  # retrieve MLflow model metadata

Implementation:

  • FastAPI handler dispatches a Celery task to the ml queue.
  • Blocks up to _SYNC_TIMEOUT = 30s, polling task.state every 50 ms.
  • Returns result directly on success; 504 Gateway Timeout on timeout.
  • PredictionService is initialised once per worker process via worker_process_init signal, avoiding repeated MLflow model loads.
  • Use ?stage=candidate (or any loaded alias) to target a non-default model. Loaded aliases are configured via MLFLOW_MODEL_STAGES; defaults are champion, candidate, smoke.

Characteristics:

  • Strict 30 s SLO.
  • Features read server-side from match_features.parquet — no caller-supplied feature dict.
  • Immediate failure feedback (4xx/5xx).

When to use:

  • UI-driven on-demand predictions.
  • Real-time decision support.

Asynchronous inference

Partially implemented — the supporting infrastructure is in place; the HTTP endpoint is not yet registered.

What exists: - Request/response schemas: AsyncPredictRequest / AsyncPredictResponse (src/app/schemas/predict.py) - Celery task: predict_match in src/app/tasks/predict.py (also used by the sync path) - 202-accepted response helper: _task_accepted() in src/app/routers/predict.py - Result polling: GET /monitoring/task_status/{task_id} (✅ operational)

What is missing: the @router.post("/async/") route binding in src/app/routers/predict.py.


Operational trade-offs

Aspect Sync Async
Latency Low (≤30 s SLO) Higher (queue wait)
Throughput Limited High
Complexity Lower Higher
Failure mode Immediate (504) Deferred
UX Direct response Poll status_url

Safety considerations

  • Async jobs are idempotent — same match_id re-submission is safe.
  • Failed tasks are retried up to max_retries=2 times with a 10 s delay before the error is propagated to the result backend.
  • Prometheus metric prediction_requests_total is incremented on every on-demand request dispatched to the Celery queue. The label is always source="async" — this reflects the label name used in the implementation for all Celery dispatches, regardless of whether the HTTP caller is waiting synchronously or not.
  • prediction_duration_seconds is defined in the metrics registry but is not yet instrumented in the router; it will always read zero.
  • On-demand prediction results are optionally cached in Redis. PredictionService connects to REDIS_CACHE_URL (defaults to CELERY_RESULT_BACKEND) on first use; if Redis is unreachable, caching is silently skipped. Cache key is predict:{match_id}:{run_id}, TTL is PREDICTION_CACHE_TTL seconds (default 3600). The cached: bool field in the response reflects whether the result was a cache hit.