Key Decisions¶
Status: ✅ Implemented — full ADRs with context, alternatives, and consequences are in the ADR section.
This page summarises the five most consequential architectural decisions in SoccerPredictAI.
ADR-0001 — Pipeline Orchestration¶
Decision: Use two orchestrators with explicitly separated responsibilities — Airflow for external scheduled ETL and DVC for offline ML reproducibility. A single orchestrator cannot serve both purposes well: Airflow has poor experiment reproducibility; DVC has no native scheduler.
Key trade-off: Operators must understand two tools. The payoff is that DVC stages
are deterministic by design — dvc repro always produces the same artifacts for the
same code + data, regardless of which environment it runs in. Airflow DAGs reference
DVC only as a shell command, keeping the boundary clean.
Rejected alternatives: Airflow-only (weak ML reproducibility), Kubeflow Pipelines (too much operational overhead for a single-node deployment).
ADR-0002 — Data Versioning Strategy¶
Decision: Version all datasets with DVC backed by MinIO (S3-compatible).
Every dvc repro run is tied to a specific dataset version via the DVC lock file,
so experiments are fully reproducible even after the raw data changes.
Key trade-off: DVC adds tooling complexity and requires a running MinIO instance.
The payoff is complete data lineage: given any git commit, dvc pull && dvc repro
reproduces the exact training dataset and model.
Rejected alternatives: Git LFS (no pipeline integration), pure S3 versioning (no experiment-level traceability), LakeFS (operational overhead).
ADR-0003 — Model Registry & Promotion¶
Decision: Use MLflow Tracking + Model Registry as the single source of truth
for trained models. The serving layer loads models exclusively via stable aliases
(champion, challenger) — never hardcoded version numbers. Promotion requires
passing documented metric gates (log-loss, Brier score, ECE thresholds) and a manual
review checklist.
Key trade-off: Requires MLflow infrastructure and registry governance discipline. The payoff is auditable lineage, safe rollback (alias swap + pod restart), and complete decoupling between the training pipeline and the serving layer.
Rejected alternatives: Filesystem-based model storage (no traceability), direct CI artifact deployment (weak rollback semantics).
ADR-0004 — Secrets Management¶
Decision: Encrypt all secrets at rest with SOPS + age and commit the encrypted files to git. Decryption happens only at deploy time via Kubernetes Secrets injection. CI variables are used only for the age private key, never for application secrets directly.
Key trade-off: Requires key management discipline and slightly more complex onboarding. The payoff is that secrets are versioned, auditable, and diff-able — the encrypted file changes when the secret changes, which is visible in git history without exposing the value.
Rejected alternatives: Plain Kubernetes Secrets (plaintext base64 in etcd), HashiCorp Vault (operational overhead), GitLab CI variables only (limited auditability).
ADR-0005 — Serving Modes (Sync vs Async)¶
Decision: Expose two inference modes from the same service: synchronous
(POST /predict/) for interactive low-latency requests and asynchronous
(POST /predict/async/) for heavier workloads. Both routes share the same
PredictionService and model loading path; the difference is only in how
the Celery task result is returned to the caller.
Key trade-off: Increased system complexity — operators must monitor both the HTTP layer and the Celery queue. The payoff is clear performance isolation: sync requests have an explicit 30 s SLO with a timeout; async requests get backpressure and retry semantics from RabbitMQ without blocking HTTP workers.
Rejected alternatives: Sync-only (no backpressure, single failure domain), async-only (poor UX for the interactive Streamlit dashboard).
See ADR Overview for the complete decision log, including ADR-0006 (no HTTP batch endpoint).