System Overview¶
SoccerPredictAI is an end-to-end MLOps system that ingests raw football match statistics, trains and versions a 1×2 outcome classifier, serves predictions through a live REST API, and monitors model quality in production — all on a self-hosted single-node Kubernetes cluster.
For full readiness status see Implementation Status. For detailed layer docs see Further reading.
Architecture at a glance¶
flowchart TB
subgraph Ingestion["Ingestion ✅"]
SRC[WhoScored.com]
AIR[Airflow DAGs\n14 scheduled]
SEL[Selenoid\nbrowser grid]
PG[(PostgreSQL)]
SRC -->|scrape via Selenoid| AIR
AIR -->|triggers| SEL
SEL -->|normalized rows| PG
end
subgraph Storage["Storage & Versioning ✅"]
MINIO[(MinIO / S3)]
DVC_CACHE[DVC artifact cache]
PG -->|raw export| MINIO
MINIO --> DVC_CACHE
end
subgraph Pipeline["ML Pipeline ✅ — 22 DVC stages"]
GE[Great Expectations\n4 validation gates]
FE[Feature Engineering\nELO · rolling stats]
TUNE[Optuna tuning\nXGBoost · HGB]
MLFLOW[MLflow Registry\nsmoke → candidate → champion]
DVC_CACHE --> GE --> FE --> TUNE --> MLFLOW
end
subgraph Serving["Serving ✅"]
API[FastAPI\nInference Service]
MQ[RabbitMQ]
WML[Celery worker-ml]
REDIS[(Redis\nprediction cache)]
MLFLOW -->|champion alias| API
API -->|enqueue| MQ
MQ --> WML
WML -->|cache result| REDIS
REDIS -->|cache hit| API
end
subgraph Observability["Observability ✅"]
PROM[Prometheus\n9 metrics]
GRAF[Grafana\n2 dashboards]
EV[Evidently\ndrift detection]
API --> PROM
WML --> PROM
PROM --> GRAF
FE -.->|daily DAG| EV
end
Not shown (external): Streamlit UI (soccer.dmitryivanov.dev) calls GET /predict/* over public HTTPS.
Partial: POST /predict/async/ HTTP route is not yet registered; Celery task and schemas exist.
Component responsibilities¶
| Component | Role | Technology | Key boundary |
|---|---|---|---|
| Airflow | Schedules and orchestrates all ETL, quality monitoring, and model quality DAGs | Apache Airflow 2, Selenoid CDP | Does not implement domain logic; triggers workers or calls pipelines |
| Selenoid | Provides headless Chrome sessions for WhoScored scraping | Selenoid browser grid (external host) | Outside K8s cluster; single operational dependency |
| PostgreSQL | Canonical data store for match statistics | PostgreSQL (K8s, namespace ds) |
Only writable by ingestion workers; read by DVC export stage |
| MinIO | S3-compatible object storage for all DVC artifacts and serving files | MinIO (K8s, namespace ds) |
DVC-managed; serving layer reads parquet files at startup and polls for changes |
| DVC | Reproducible ML pipeline orchestration; artifact versioning | DVC 3.x | Runs offline (CI or local); does not share runtime with FastAPI |
| Great Expectations | Data contract enforcement at 4 pipeline gates | Great Expectations | Blocks the pipeline on schema or distribution violations; output = validation JSON |
| Feature Engineering | Pure, stateless transformation of raw stats into model-ready features | Python, pandas, src/features/ |
No IO, no side effects; same code runs offline (DVC) and online (Celery worker) |
| Optuna + MLflow | Hyperparameter search + experiment tracking and model registry | Optuna, MLflow 3.x (self-hosted) | All models enter serving only through the registry; no local file loading |
| FastAPI | Inference REST API; request routing, auth, metrics middleware | FastAPI, Uvicorn | Does not run inference directly; delegates to Celery ml queue |
| RabbitMQ | Message broker between FastAPI and Celery workers | RabbitMQ (K8s, namespace soccer-api) |
Single point of failure for all on-demand inference (no clustering) |
| Celery worker-ml | Executes inference tasks; loads MLflow model once per process | Celery, Python | Model loaded on worker_process_init; Redis cache writes happen here |
| Redis | Prediction result cache + Celery result backend | Redis (K8s, namespace soccer-api) |
Cache key: predict:{match_id}:{run_id} — auto-invalidates on model change; TTL = PREDICTION_CACHE_TTL (default 3600 s) |
| Prometheus | Metrics collection from FastAPI and Celery worker | Prometheus (K8s, namespace ds) |
Scrapes /metrics; 9 metrics defined; prediction_duration_seconds defined but not yet instrumented |
| Grafana | Metrics dashboards | Grafana (K8s, namespace ds) |
2 deployed dashboards: "Soccer — ML Quality & Betting", "SoccerPredictAI" |
| Evidently | Feature drift detection | Evidently (DVC stage + Airflow DAG) | Writes reports/drift/latest.json; refreshes model_feature_drift_score gauge via GET /monitoring/drift |
| SOPS + age | Secret encryption | SOPS, age | All secrets encrypted at rest; decrypted by GitLab CI at deploy time |
| Helm + GitLab CI | Infrastructure deployment | Helm 3, GitLab CI/CD | CI pushes chart changes via SSH; no manual kubectl apply in production |
Flows¶
Training flow¶
WhoScored.com
→ Airflow DAG (scheduled scrape)
→ Selenoid browser automation
→ PostgreSQL (normalized rows)
→ DVC: load_data_from_sources
→ Great Expectations: validate_raw gate
→ DVC: preprocessing → validate_finished / validate_future
→ DVC: feature_engineering
→ Great Expectations: validate_features gate
→ DVC: split_data
→ DVC: classification_models (baseline, frac=0.01)
→ DVC: tune_xgb + tune_hgb (Optuna, frac=0.1, n_trials=20)
→ DVC: select_model (CV log-loss comparison)
→ DVC: final_train (isotonic calibration, calib_frac=0.15)
→ DVC: register_model → MLflow Registry (alias: smoke)
→ DVC: promote_model (auto → candidate; manual gate → champion)
→ DVC: batch_inference → predictions.parquet → MinIO
→ DVC: monitor_drift (Evidently drift report)
LogReg tuning is in the DAG (
tune_logreg) but disabled in production (tuning_logreg.enabled: false). Full stage reference: Training Pipeline.
Inference request flow¶
sequenceDiagram
participant Client
participant FastAPI
participant Redis
participant RabbitMQ
participant WorkerML
Client->>FastAPI: GET /predict/{match_id} [X-API-Key]
FastAPI->>FastAPI: Auth check · feature lookup (match_features.parquet)
FastAPI->>Redis: Cache lookup predict:{match_id}:{run_id}
alt cache hit
Redis-->>FastAPI: cached result [cached=true]
FastAPI-->>Client: 200 OK
else cache miss
FastAPI->>RabbitMQ: enqueue predict_match task → ml queue
RabbitMQ->>WorkerML: dispatch task
WorkerML->>WorkerML: run inference (champion model, loaded on worker init)
WorkerML->>Redis: cache result (TTL=PREDICTION_CACHE_TTL)
WorkerML-->>FastAPI: result dict (30 s poll)
FastAPI-->>Client: 200 OK / 504 on timeout
end
Precomputed path (no Celery, no cache miss possible):
GET /predict/predictions/ · GET /predict/precomputed/{match_id} · GET /predict/cards/
→ read directly from in-memory predictions.parquet cache (refreshed from MinIO).
Production reality¶
| Area | Status | Notes |
|---|---|---|
| End-to-end scrape → inference | ✅ Operational | Full path running at soccer.dmitryivanov.dev |
Sync inference (GET /predict/{match_id}) |
✅ Operational | 30 s SLO; Celery ml queue |
| Precomputed predictions (5 endpoints) | ✅ Operational | Served from in-memory parquet, no Celery |
| Redis prediction cache | ✅ Operational | Implemented in PredictionService; gracefully degrades if Redis unreachable |
| MLflow model registry | ✅ Operational | smoke → candidate automated; champion requires manual sign-off |
| Prometheus metrics + Evidently drift | ✅ Operational | 9 metrics; daily drift DAG |
| Grafana dashboards | ✅ Operational | 2 dashboards deployed |
POST /predict/async/ HTTP route |
🚧 Scaffold | Celery task, Pydantic schemas, _task_accepted() helper, and polling endpoint (GET /monitoring/task_status/{task_id}) exist; @router.post("/async/") not registered |
prediction_duration_seconds metric |
🚧 Defined only | Prometheus histogram defined in metrics.py; not yet instrumented in router |
| Alertmanager rules | 📋 Planned | Runbooks exist; rules not deployed |
Automated champion promotion |
📋 Planned | Current gate: promote_model auto-promotes to candidate; human sign-off required for champion |
| Automated retraining trigger | 📋 Planned | Manual dvc repro or CI job today |
| Cache invalidation on model promotion | 📋 Planned | Redis TTL-based expiry only; no flush on champion swap |
| H2H features | ⚙️ Disabled | include_h2h: false in production params |
| Rest-days features | ⚙️ Disabled | include_rest_days: false in production params |
| LogReg tuning | ⚙️ Disabled | tuning_logreg.enabled: false; stage runs as no-op |
Operational characteristics¶
Reproducibility
- All pipeline stages are DVC-tracked; dvc repro reproduces from any checkpoint.
- Params are managed by Hydra (conf/config.yaml, overlay via conf/experiment/); params.yaml is the generated runtime snapshot.
- All seeds are explicit; no hidden randomness.
Experiment tracking
- Every training run is logged to MLflow: params, metrics, artifact URIs, data lineage.
- Models are promoted through ci-smoke → smoke → candidate → champion aliases with explicit gates.
CI/CD
- GitLab CI runs: lint (ruff, mypy), Dockerfile lint (hadolint), K8s manifest validation (kubeconform), Airflow DAG check, DVC DAG check, pytest, SAST (semgrep), container scan, smoke train, Helm deploy.
- Image tagging uses $CI_COMMIT_SHA + latest. No per-branch tags.
- train:smoke CI job runs the full 22-stage pipeline with experiment=smoke (frac=0.001) as a wiring check.
Security
- All secrets encrypted with SOPS + age; age private key lives outside the repo.
- /predict/* endpoints require X-API-Key header; /monitoring/* is unauthenticated (planned restriction).
- Container images scanned by GitLab container_scanning in CI.
Infrastructure
- Single-node Kubernetes on a self-hosted VPS (healserver).
- A node failure is a full-service outage (no HA, no autoscaling, explicit tradeoff — see Trade-offs).
- Monthly infra cost: <€30.
Repository map¶
| Path | What lives here |
|---|---|
src/data/ |
Data access layer: schemas, splits, storage abstractions, DB session |
src/features/ |
Feature engineering: pure, deterministic transforms; no IO |
src/models/ |
Model wrappers, losses, metrics; no IO |
src/pipelines/ |
DVC stage entrypoints (CLI); orchestration only — no domain logic |
src/app/ |
FastAPI application: routers, services, Celery tasks, Pydantic schemas, metrics |
src/app/routers/ |
Route handlers: predict, monitoring, healthcheck, livescores |
src/app/tasks/ |
Celery tasks: predict_match, get_model_info |
src/app/services/ |
Business services: PredictionService, FeatureLookupService, FonbetOddsService, … |
airflow/dags/ |
Scheduled Airflow DAGs: ETL (scraping, Fonbet odds), ML monitoring, inference |
conf/ |
Hydra configs: config.yaml (base), experiment/smoke.yaml, experiment/test.yaml |
params.yaml |
DVC/Hydra-generated runtime parameter snapshot (source of truth for stage params) |
dvc.yaml |
DVC pipeline: 22 stages, dependencies, outputs, metrics |
k8s/ |
Kubernetes manifests (raw) and Helm chart (k8s/helm/) |
docker/ |
Per-service Dockerfiles: api, celery-worker-ml, dvc-runner, airflow-custom, env-ml |
tests/ |
Test suite: unit/, property/, service/, contract/, load/ |
docs/ |
MkDocs engineering documentation (this site) |
reports/ |
Quarto analysis reports; not production code |
conf/config.yaml |
Single config entrypoint; all pipeline params derive from here |
.github/ (or .gitlab-ci.yml) |
CI/CD pipeline definition |
Further reading¶
| Topic | Document |
|---|---|
| Implementation status (canonical) | status.md |
| ML pipeline stages (all 22) | training-pipeline.md |
| Inference modes and API contract | inference-modes.md |
| Model registry and promotion gates | model-registry.md |
| Prometheus metrics reference | monitoring/metrics.md |
| Deployment topology and namespaces | deployment-view.md |
| Container responsibilities (C4 L2) | c4-containers.md |
| CI/CD pipeline stages | cicd/index.md |
| Architecture principles | principles.md |
| Known limitations and trade-offs | tradeoffs.md |
| Quickstart (reproduce the pipeline) | quickstart.md |