Architecture Tour¶

Status: ✅ Implemented — see Implementation Status for per-component detail.

System diagram¶

flowchart TB
    A[WhoScored.com] -->|"Airflow ETL\n(Selenoid + Celery)"| B[(PostgreSQL)]
    B -->|"DVC export\nstage"| C[MinIO / S3]
    C -->|"dvc pull"| D[Versioned Datasets]
    D -->|dvc repro| E[DVC ML Pipeline]
    E -->|register_model| F[MLflow Registry\nchampion alias]
    F -->|"model URI\nat startup"| G[FastAPI + Celery\nworker-ml]
    G -->|"1×2 proba\n< 50 ms p95"| H[Streamlit UI\n/ API clients]
    G -->|"/metrics"| I[Prometheus]
    I --> J[Grafana\n15 panels]
    D -->|"monitor_drift\nDVC stage"| K[Evidently\nDrift Report]
    K -->|"drift_score\ngauge"| I

Component overview¶

1. Data ingestion (Airflow)¶

Five Airflow DAGs (etl_export_01, etl_livescores_01–04) run on a daily schedule. Each DAG triggers browser automation via Selenoid — a scalable Selenium Grid running in Docker — and routes the scraping work through a Celery task chain so retries and concurrency are handled outside Airflow's own executor. Scraped match records (fixtures, results, odds) land in PostgreSQL as the canonical source of truth. A DVC export stage dumps a point-in-time snapshot to MinIO, creating an immutable, versioned parquet file that downstream pipeline stages depend on.

2. Storage layer (PostgreSQL + MinIO)¶

PostgreSQL holds all structured match data and is the only system that writes live records. MinIO provides S3-compatible object storage for DVC remote artifacts — datasets, model binaries, and Evidently reports. Secrets (DB credentials, MinIO keys, API tokens) are encrypted with SOPS + age and committed to the repository; decryption happens only at deploy time via K8s Secrets (ADR-0004).

3. ML pipeline (DVC)¶

The full offline ML workflow is expressed as a DVC pipeline (dvc.yaml): export → preprocess → split → features → tune → final_train → register_model → error_analysis → monitor_drift. Each stage has explicit deps and outs, so dvc repro only re-runs stages whose inputs have changed.

Key properties: - Deterministic: seeds are explicit; all randomness is seeded through params.yaml. - Experiment variants are Hydra overlays (conf/experiment/prod.yaml, test.yaml); switching profiles requires no code change — dvc exp run -S 'experiment=prod'. - No data leakage: temporal train/validation/test split ensures no future data touches any training stage.

4. Experiment tracking & registry (MLflow)¶

Every dvc repro training run opens an MLflow run and logs parameters, metrics (log-loss, Brier, ECE, ROC-AUC, per-class precision/recall), calibration method choice, feature importances, calibration curves, and a reference_features.parquet snapshot used by drift monitoring.

Models are registered via mlflow.register_model and promoted through stable aliases (champion, challenger) rather than deprecated stage strings — making rollback a one-line alias swap. The serving layer always loads by alias, never by a hardcoded version, so promotions take effect at the next pod restart without a new image build. Promotion gates are documented in docs/ml/model-registry.md#promotion-policy.

5. Serving (FastAPI + Celery + RabbitMQ)¶

The API layer exposes two inference modes (ADR-0005):

Mode	Endpoint	Latency target	Mechanism
Sync	`POST /predict/`	p95 < 500 ms	FastAPI → Celery task → 30 s timeout
Async	`POST /predict/async/`	best-effort	Celery job; poll `GET /monitoring/task_status/{id}`
Lookup	`GET /predict/{match_id}`	p95 < 50 ms	Direct parquet lookup from `batch_inference`

The model is lazy-loaded once per worker process via PredictionService; no model I/O happens on the hot path. All /predict/* endpoints require an X-API-Key header verified with hmac.compare_digest to prevent timing attacks. Kubernetes liveness (/healthcheck/) and Prometheus scrape (/metrics) endpoints are exempt from authentication.

6. Infrastructure (Kubernetes + Helm)¶

All services run on a single-node self-hosted Kubernetes cluster managed through Helm charts with parameterised values files. The ingress layer (nginx-ingress) enforces rate-limiting (configurable rps/burst/connections via ingress.rateLimit). Multi-stage Docker builds produce separate images for the API and the ML worker, keeping the serving image free of training dependencies. GitLab CI builds, tests, and deploys on every push to master; secrets are injected from SOPS-decrypted files, never from plaintext CI variables.

7. Observability (Prometheus, Grafana, Evidently)¶

Prometheus scrapes 9 custom metrics from FastAPI and the Celery worker, covering HTTP traffic (RPS, latency, error rate), prediction lifecycle (sync/async counts, timeouts, inference latency, confidence distribution), model metadata (loaded version, registration timestamp), and drift score. Five alerting rule files (monitoring/prometheus/rules/) cover API error rate, latency, Celery backlog, feature drift, and model staleness.

Grafana — two dashboards deployed (k8s/helm/ns_soccer-api/files/dashboards/soccer-api.json and soccer-ml-quality.json), covering HTTP Traffic, Celery, Model, and ML Quality panels.

Evidently drift detection runs as a DVC stage (monitor_drift) and a daily Airflow DAG. It compares recent production feature distributions against a reference snapshot saved during training, writes a scored JSON report and an HTML report, and exposes the drift score as a Prometheus gauge via GET /monitoring/drift.

See Architecture Overview for the full C4-level documentation.