Architecture Tour¶
Status: ✅ Implemented — see Implementation Status for per-component detail.
System diagram¶
flowchart LR
A[WhoScored.com] -->|"Airflow ETL\n(Selenoid + Celery)"| B[(PostgreSQL)]
B -->|"DVC export\nstage"| C[MinIO / S3]
C -->|"dvc pull"| D[Versioned Datasets]
D -->|dvc repro| E[DVC ML Pipeline]
E -->|register_model| F[MLflow Registry\nchampion alias]
F -->|"model URI\nat startup"| G[FastAPI + Celery\nworker-ml]
G -->|"1×2 proba\n< 50 ms p95"| H[Streamlit UI\n/ API clients]
G -->|"/metrics"| I[Prometheus]
I --> J[Grafana\n15 panels]
D -->|"monitor_drift\nDVC stage"| K[Evidently\nDrift Report]
K -->|"drift_score\ngauge"| I
Component overview¶
1. Data ingestion (Airflow)¶
Five Airflow DAGs (etl_export_01, etl_livescores_01–04) run on a daily schedule.
Each DAG triggers browser automation via Selenoid — a scalable Selenium Grid running
in Docker — and routes the scraping work through a Celery task chain so retries and
concurrency are handled outside Airflow's own executor.
Scraped match records (fixtures, results, odds) land in PostgreSQL as the canonical
source of truth. A DVC export stage dumps a point-in-time snapshot to MinIO,
creating an immutable, versioned parquet file that downstream pipeline stages depend on.
2. Storage layer (PostgreSQL + MinIO)¶
PostgreSQL holds all structured match data and is the only system that writes live records. MinIO provides S3-compatible object storage for DVC remote artifacts — datasets, model binaries, and Evidently reports. Secrets (DB credentials, MinIO keys, API tokens) are encrypted with SOPS + age and committed to the repository; decryption happens only at deploy time via K8s Secrets (ADR-0004).
3. ML pipeline (DVC)¶
The full offline ML workflow is expressed as a DVC pipeline (dvc.yaml): export →
preprocess → split → features → tune → final_train → register_model → error_analysis →
monitor_drift. Each stage has explicit deps and outs, so dvc repro only re-runs
stages whose inputs have changed.
Key properties:
- Deterministic: seeds are explicit; all randomness is seeded through params.yaml.
- Experiment variants are Hydra overlays (conf/experiment/prod.yaml, test.yaml);
switching profiles requires no code change — dvc exp run -S 'experiment=prod'.
- No data leakage: temporal train/validation/test split ensures no future data touches
any training stage.
4. Experiment tracking & registry (MLflow)¶
Every dvc repro training run opens an MLflow run and logs parameters, metrics
(log-loss, Brier, ECE, ROC-AUC, per-class precision/recall), calibration method choice,
feature importances, calibration curves, and a reference_features.parquet snapshot
used by drift monitoring.
Models are registered via mlflow.register_model and promoted through stable aliases
(champion, challenger) rather than deprecated stage strings — making rollback a
one-line alias swap. The serving layer always loads by alias, never by a hardcoded
version, so promotions take effect at the next pod restart without a new image build.
Promotion gates are documented in docs/ml/model-registry.md#promotion-policy.
5. Serving (FastAPI + Celery + RabbitMQ)¶
The API layer exposes two inference modes (ADR-0005):
| Mode | Endpoint | Latency target | Mechanism |
|---|---|---|---|
| Sync | POST /predict/ |
p95 < 500 ms | FastAPI → Celery task → 30 s timeout |
| Async | POST /predict/async/ |
best-effort | Celery job; poll GET /monitoring/task_status/{id} |
| Lookup | GET /predict/{match_id} |
p95 < 50 ms | Direct parquet lookup from batch_inference |
The model is lazy-loaded once per worker process via PredictionService; no model
I/O happens on the hot path. All /predict/* endpoints require an X-API-Key header
verified with hmac.compare_digest to prevent timing attacks.
Kubernetes liveness (/healthcheck/) and Prometheus scrape (/metrics) endpoints are
exempt from authentication.
6. Infrastructure (Kubernetes + Helm)¶
All services run on a single-node self-hosted Kubernetes cluster managed through
Helm charts with parameterised values files. The ingress layer (nginx-ingress) enforces
rate-limiting (configurable rps/burst/connections via ingress.rateLimit).
Multi-stage Docker builds produce separate images for the API and the ML worker,
keeping the serving image free of training dependencies.
GitLab CI builds, tests, and deploys on every push to develop; secrets are injected
from SOPS-decrypted files, never from plaintext CI variables.
7. Observability (Prometheus, Grafana, Evidently)¶
Prometheus scrapes 8 custom metrics from FastAPI and the Celery worker, covering HTTP
traffic (RPS, latency, error rate), prediction lifecycle (sync/async counts, timeouts,
inference latency, confidence distribution), model metadata (loaded version, registration
timestamp), and drift score. Five alerting rule files (monitoring/prometheus/rules/)
cover API error rate, latency, Celery backlog, feature drift, and model staleness.
Grafana (docker/grafana/provisioning/dashboards/soccer.json) ships with 15 panels
organised into HTTP Traffic, Celery, Model, and ML Quality rows. The ML Quality row
includes a Feature Drift Score gauge and a Model Age stat that turn red at the thresholds
defined in the alerting rules.
Evidently drift detection runs as a DVC stage (monitor_drift) and a daily Airflow
DAG. It compares recent production feature distributions against a reference snapshot
saved during training, writes a scored JSON report and an HTML report, and exposes
the drift score as a Prometheus gauge via GET /monitoring/drift.
See Architecture Overview for the full C4-level documentation.