System Overview¶

SoccerPredictAI is an end-to-end MLOps system that ingests raw football match statistics, trains and versions a 1×2 outcome classifier, serves predictions through a live REST API, and monitors model quality in production — all on a self-hosted single-node Kubernetes cluster.

For full readiness status see Implementation Status. For detailed layer docs see Further reading.

Architecture at a glance¶

flowchart TB
    subgraph Ingestion["Ingestion  ✅"]
        SRC[WhoScored.com]
        AIR[Airflow DAGs\n14 scheduled]
        SEL[Selenoid\nbrowser grid]
        PG[(PostgreSQL)]
        SRC -->|scrape via Selenoid| AIR
        AIR -->|triggers| SEL
        SEL -->|normalized rows| PG
    end

    subgraph Storage["Storage & Versioning  ✅"]
        MINIO[(MinIO / S3)]
        DVC_CACHE[DVC artifact cache]
        PG -->|raw export| MINIO
        MINIO --> DVC_CACHE
    end

    subgraph Pipeline["ML Pipeline  ✅  — 22 DVC stages"]
        GE[Great Expectations\n4 validation gates]
        FE[Feature Engineering\nELO · rolling stats]
        TUNE[Optuna tuning\nXGBoost · HGB]
        MLFLOW[MLflow Registry\nsmoke → candidate → champion]
        DVC_CACHE --> GE --> FE --> TUNE --> MLFLOW
    end

    subgraph Serving["Serving  ✅"]
        API[FastAPI\nInference Service]
        MQ[RabbitMQ]
        WML[Celery worker-ml]
        REDIS[(Redis\nprediction cache)]
        MLFLOW -->|champion alias| API
        API -->|enqueue| MQ
        MQ --> WML
        WML -->|cache result| REDIS
        REDIS -->|cache hit| API
    end

    subgraph Observability["Observability  ✅"]
        PROM[Prometheus\n10 metrics]
        GRAF[Grafana\n2 dashboards]
        EV[Evidently\ndrift detection]
        API --> PROM
        WML --> PROM
        PROM --> GRAF
        FE -.->|daily DAG| EV
    end

Not shown (external): Streamlit UI (soccer.dmitryivanov.dev) calls GET /predict/* over public HTTPS. Partial: POST /predict/async/ HTTP route is not yet registered; Celery task and schemas exist.

Component responsibilities¶

Component	Role	Technology	Key boundary
Airflow	Schedules and orchestrates all ETL, quality monitoring, and model quality DAGs	Apache Airflow 2, Selenoid CDP	Does not implement domain logic; triggers workers or calls pipelines
Selenoid	Provides headless Chrome sessions for WhoScored scraping	Selenoid browser grid (external host)	Outside K8s cluster; single operational dependency
PostgreSQL	Canonical data store for match statistics	PostgreSQL (K8s, namespace `ds`)	Only writable by ingestion workers; read by DVC export stage
MinIO	S3-compatible object storage for all DVC artifacts and serving files	MinIO (K8s, namespace `ds`)	DVC-managed; serving layer reads parquet files at startup and polls for changes
DVC	Reproducible ML pipeline orchestration; artifact versioning	DVC 3.x	Runs offline (CI or local); does not share runtime with FastAPI
Great Expectations	Data contract enforcement at 4 pipeline gates	Great Expectations	Blocks the pipeline on schema or distribution violations; output = validation JSON
Feature Engineering	Pure, stateless transformation of raw stats into model-ready features	Python, pandas, `src/features/`	No IO, no side effects; same code runs offline (DVC) and online (Celery worker)
Optuna + MLflow	Hyperparameter search + experiment tracking and model registry	Optuna, MLflow 3.x (self-hosted)	All models enter serving only through the registry; no local file loading
FastAPI	Inference REST API; request routing, auth, metrics middleware	FastAPI, Uvicorn	Does not run inference directly; delegates to Celery `ml` queue
RabbitMQ	Message broker between FastAPI and Celery workers	RabbitMQ (K8s, namespace `soccer-api`)	Single point of failure for all on-demand inference (no clustering)
Celery worker-ml	Executes inference tasks; loads MLflow model once per process	Celery, Python	Model loaded on `worker_process_init`; Redis cache writes happen here
Redis	Prediction result cache + Celery result backend	Redis (K8s, namespace `soccer-api`)	Cache key: `predict:{match_id}:{run_id}` — auto-invalidates on model change; TTL = `PREDICTION_CACHE_TTL` (default 3600 s)
Prometheus	Metrics collection from FastAPI and Celery worker	Prometheus (K8s, namespace `ds`)	Scrapes `/metrics`; 10 metrics defined
Grafana	Metrics dashboards	Grafana (K8s, namespace `ds`)	2 deployed dashboards: "Soccer — ML Quality & Betting", "SoccerPredictAI"
Evidently	Feature drift detection	Evidently (DVC stage + Airflow DAG)	Writes `reports/drift/latest.json`; refreshes `model_feature_drift_score` gauge via `GET /monitoring/drift`
SOPS + age	Secret encryption	SOPS, age	All secrets encrypted at rest; decrypted by GitLab CI at deploy time
Helm + GitLab CI	Infrastructure deployment	Helm 3, GitLab CI/CD	CI pushes chart changes via SSH; no manual `kubectl apply` in production

Flows¶

Training flow¶

WhoScored.com
  → Airflow DAG (scheduled scrape)
    → Selenoid browser automation
      → PostgreSQL (normalized rows)
        → DVC: load_data_from_sources
          → Great Expectations: validate_raw gate
            → DVC: preprocessing → validate_finished / validate_future
              → DVC: feature_engineering
                → Great Expectations: validate_features gate
                  → DVC: split_data
                    → DVC: classification_models (baseline, frac=0.01)
                      → DVC: tune_xgb + tune_hgb (Optuna, frac=0.1, n_trials=20)
                        → DVC: select_model (CV log-loss comparison)
                          → DVC: final_train (isotonic calibration, calib_frac=0.15)
                            → DVC: register_model → MLflow Registry (alias: smoke)
                              → DVC: promote_model (auto → candidate; manual gate → champion)
                                → DVC: batch_inference → predictions.parquet → MinIO
                                  → DVC: monitor_drift (Evidently drift report)

LogReg tuning is in the DAG (tune_logreg) but disabled in production (tuning_logreg.enabled: false). Full stage reference: Training Pipeline.

Inference request flow¶

sequenceDiagram
    participant Client
    participant FastAPI
    participant Redis
    participant RabbitMQ
    participant WorkerML

    Client->>FastAPI: GET /predict/{match_id}  [X-API-Key]
    FastAPI->>FastAPI: Auth check · feature lookup (match_features.parquet)
    FastAPI->>Redis: Cache lookup  predict:{match_id}:{run_id}
    alt cache hit
        Redis-->>FastAPI: cached result  [cached=true]
        FastAPI-->>Client: 200 OK
    else cache miss
        FastAPI->>RabbitMQ: enqueue predict_match task → ml queue
        RabbitMQ->>WorkerML: dispatch task
        WorkerML->>WorkerML: run inference (champion model, loaded on worker init)
        WorkerML->>Redis: cache result (TTL=PREDICTION_CACHE_TTL)
        WorkerML-->>FastAPI: result dict  (30 s poll)
        FastAPI-->>Client: 200 OK  /  504 on timeout
    end

Precomputed path (no Celery, no cache miss possible): GET /predict/predictions/ · GET /predict/precomputed/{match_id} · GET /predict/cards/ → read directly from in-memory predictions.parquet cache (refreshed from MinIO).

Production reality¶

Area	Status	Notes
End-to-end scrape → inference	✅ Operational	Full path running at soccer.dmitryivanov.dev
Sync inference (`GET /predict/{match_id}`)	✅ Operational	30 s SLO; Celery `ml` queue
Precomputed predictions (5 endpoints)	✅ Operational	Served from in-memory parquet, no Celery
Redis prediction cache	✅ Operational	Implemented in `PredictionService`; gracefully degrades if Redis unreachable
MLflow model registry	✅ Operational	`smoke → candidate` automated; `champion` requires manual sign-off
Prometheus metrics + Evidently drift	✅ Operational	10 metrics; daily drift DAG
Grafana dashboards	✅ Operational	2 dashboards deployed
`POST /predict/async/` HTTP route	🚧 Scaffold	Celery task, Pydantic schemas, `_task_accepted()` helper, and polling endpoint (`GET /monitoring/task_status/{task_id}`) exist; `@router.post("/async/")` not registered
`prediction_duration_seconds` metric	✅ Operational	Instrumented in `src/app/routers/predict.py` (`PREDICTION_LATENCY.observe()` on every sync prediction)
Alertmanager rules	📋 Planned	Runbooks exist; rules not deployed
Automated `champion` promotion	📋 Planned	Current gate: `promote_model` auto-promotes to `candidate`; human sign-off required for `champion`
Automated retraining trigger	📋 Planned	Manual `dvc repro` or CI job today
Cache invalidation on model promotion	📋 Planned	Redis TTL-based expiry only; no flush on `champion` swap
H2H features	⚙️ Disabled	`include_h2h: false` in production params
Rest-days features	⚙️ Disabled	`include_rest_days: false` in production params
LogReg tuning	⚙️ Disabled	`tuning_logreg.enabled: false`; stage runs as no-op

Operational characteristics¶

Reproducibility - All pipeline stages are DVC-tracked; dvc repro reproduces from any checkpoint. - Params are managed by Hydra (conf/config.yaml, overlay via conf/experiment/); params.yaml is the generated runtime snapshot. - All seeds are explicit; no hidden randomness.

Experiment tracking - Every training run is logged to MLflow: params, metrics, artifact URIs, data lineage. - Models are promoted through ci-smoke → smoke → candidate → champion aliases with explicit gates.

CI/CD - GitLab CI runs: lint (ruff, mypy), Dockerfile lint (hadolint), K8s manifest validation (kubeconform), Airflow DAG check, DVC DAG check, pytest, SAST (semgrep), container scan, smoke train, Helm deploy. - Image tagging uses $CI_COMMIT_SHA + latest. No per-branch tags. - train:smoke CI job runs the full 22-stage pipeline with experiment=smoke (frac=0.001) as a wiring check.

Security - All secrets encrypted with SOPS + age; age private key lives outside the repo. - /predict/* endpoints require X-API-Key header; /monitoring/* is unauthenticated (planned restriction). - Container images scanned by GitLab container_scanning in CI.

Infrastructure - Single-node Kubernetes on a self-hosted VPS (healserver). - A node failure is a full-service outage (no HA, no autoscaling, explicit tradeoff — see Trade-offs). - Monthly infra cost: <€30.

Repository map¶

Path	What lives here
`src/data/`	Data access layer: schemas, splits, storage abstractions, DB session
`src/features/`	Feature engineering: pure, deterministic transforms; no IO
`src/models/`	Model wrappers, losses, metrics; no IO
`src/pipelines/`	DVC stage entrypoints (CLI); orchestration only — no domain logic
`src/app/`	FastAPI application: routers, services, Celery tasks, Pydantic schemas, metrics
`src/app/routers/`	Route handlers: `predict`, `monitoring`, `healthcheck`, `livescores`
`src/app/tasks/`	Celery tasks: `predict_match`, `get_model_info`
`src/app/services/`	Business services: `PredictionService`, `FeatureLookupService`, `FonbetOddsService`, …
`airflow/dags/`	Scheduled Airflow DAGs: ETL (scraping, Fonbet odds), ML monitoring, inference
`conf/`	Hydra configs: `config.yaml` (base), `experiment/smoke.yaml`, `experiment/test.yaml`
`params.yaml`	DVC/Hydra-generated runtime parameter snapshot (source of truth for stage params)
`dvc.yaml`	DVC pipeline: 22 stages, dependencies, outputs, metrics
`k8s/`	Kubernetes manifests (raw) and Helm chart (`k8s/helm/`)
`docker/`	Per-service Dockerfiles: `api`, `celery-worker-ml`, `dvc-runner`, `airflow-custom`, `env-ml`
`tests/`	Test suite: `unit/`, `property/`, `service/`, `contract/`, `load/`
`docs/`	MkDocs engineering documentation (this site)
`reports/`	Quarto analysis reports; not production code
`conf/config.yaml`	Single config entrypoint; all pipeline params derive from here
`.github/` (or `.gitlab-ci.yml`)	CI/CD pipeline definition

Topic	Document
Implementation status (canonical)	status.md
ML pipeline stages (all 22)	training-pipeline.md
Inference modes and API contract	inference-modes.md
Model registry and promotion gates	model-registry.md
Prometheus metrics reference	monitoring/metrics.md
Deployment topology and namespaces	deployment-view.md
Container responsibilities (C4 L2)	c4-containers.md
CI/CD pipeline stages	cicd/index.md
Architecture principles	principles.md
Known limitations and trade-offs	tradeoffs.md
Quickstart (reproduce the pipeline)	quickstart.md