Skip to content

System Overview

SoccerPredictAI is an end-to-end MLOps system that ingests raw football match statistics, trains and versions a 1×2 outcome classifier, serves predictions through a live REST API, and monitors model quality in production — all on a self-hosted single-node Kubernetes cluster.

For full readiness status see Implementation Status. For detailed layer docs see Further reading.


Architecture at a glance

flowchart TB
    subgraph Ingestion["Ingestion  ✅"]
        SRC[WhoScored.com]
        AIR[Airflow DAGs\n14 scheduled]
        SEL[Selenoid\nbrowser grid]
        PG[(PostgreSQL)]
        SRC -->|scrape via Selenoid| AIR
        AIR -->|triggers| SEL
        SEL -->|normalized rows| PG
    end

    subgraph Storage["Storage & Versioning  ✅"]
        MINIO[(MinIO / S3)]
        DVC_CACHE[DVC artifact cache]
        PG -->|raw export| MINIO
        MINIO --> DVC_CACHE
    end

    subgraph Pipeline["ML Pipeline  ✅  — 22 DVC stages"]
        GE[Great Expectations\n4 validation gates]
        FE[Feature Engineering\nELO · rolling stats]
        TUNE[Optuna tuning\nXGBoost · HGB]
        MLFLOW[MLflow Registry\nsmoke → candidate → champion]
        DVC_CACHE --> GE --> FE --> TUNE --> MLFLOW
    end

    subgraph Serving["Serving  ✅"]
        API[FastAPI\nInference Service]
        MQ[RabbitMQ]
        WML[Celery worker-ml]
        REDIS[(Redis\nprediction cache)]
        MLFLOW -->|champion alias| API
        API -->|enqueue| MQ
        MQ --> WML
        WML -->|cache result| REDIS
        REDIS -->|cache hit| API
    end

    subgraph Observability["Observability  ✅"]
        PROM[Prometheus\n9 metrics]
        GRAF[Grafana\n2 dashboards]
        EV[Evidently\ndrift detection]
        API --> PROM
        WML --> PROM
        PROM --> GRAF
        FE -.->|daily DAG| EV
    end

Not shown (external): Streamlit UI (soccer.dmitryivanov.dev) calls GET /predict/* over public HTTPS. Partial: POST /predict/async/ HTTP route is not yet registered; Celery task and schemas exist.


Component responsibilities

Component Role Technology Key boundary
Airflow Schedules and orchestrates all ETL, quality monitoring, and model quality DAGs Apache Airflow 2, Selenoid CDP Does not implement domain logic; triggers workers or calls pipelines
Selenoid Provides headless Chrome sessions for WhoScored scraping Selenoid browser grid (external host) Outside K8s cluster; single operational dependency
PostgreSQL Canonical data store for match statistics PostgreSQL (K8s, namespace ds) Only writable by ingestion workers; read by DVC export stage
MinIO S3-compatible object storage for all DVC artifacts and serving files MinIO (K8s, namespace ds) DVC-managed; serving layer reads parquet files at startup and polls for changes
DVC Reproducible ML pipeline orchestration; artifact versioning DVC 3.x Runs offline (CI or local); does not share runtime with FastAPI
Great Expectations Data contract enforcement at 4 pipeline gates Great Expectations Blocks the pipeline on schema or distribution violations; output = validation JSON
Feature Engineering Pure, stateless transformation of raw stats into model-ready features Python, pandas, src/features/ No IO, no side effects; same code runs offline (DVC) and online (Celery worker)
Optuna + MLflow Hyperparameter search + experiment tracking and model registry Optuna, MLflow 3.x (self-hosted) All models enter serving only through the registry; no local file loading
FastAPI Inference REST API; request routing, auth, metrics middleware FastAPI, Uvicorn Does not run inference directly; delegates to Celery ml queue
RabbitMQ Message broker between FastAPI and Celery workers RabbitMQ (K8s, namespace soccer-api) Single point of failure for all on-demand inference (no clustering)
Celery worker-ml Executes inference tasks; loads MLflow model once per process Celery, Python Model loaded on worker_process_init; Redis cache writes happen here
Redis Prediction result cache + Celery result backend Redis (K8s, namespace soccer-api) Cache key: predict:{match_id}:{run_id} — auto-invalidates on model change; TTL = PREDICTION_CACHE_TTL (default 3600 s)
Prometheus Metrics collection from FastAPI and Celery worker Prometheus (K8s, namespace ds) Scrapes /metrics; 9 metrics defined; prediction_duration_seconds defined but not yet instrumented
Grafana Metrics dashboards Grafana (K8s, namespace ds) 2 deployed dashboards: "Soccer — ML Quality & Betting", "SoccerPredictAI"
Evidently Feature drift detection Evidently (DVC stage + Airflow DAG) Writes reports/drift/latest.json; refreshes model_feature_drift_score gauge via GET /monitoring/drift
SOPS + age Secret encryption SOPS, age All secrets encrypted at rest; decrypted by GitLab CI at deploy time
Helm + GitLab CI Infrastructure deployment Helm 3, GitLab CI/CD CI pushes chart changes via SSH; no manual kubectl apply in production

Flows

Training flow

WhoScored.com
  → Airflow DAG (scheduled scrape)
    → Selenoid browser automation
      → PostgreSQL (normalized rows)
        → DVC: load_data_from_sources
          → Great Expectations: validate_raw gate
            → DVC: preprocessing → validate_finished / validate_future
              → DVC: feature_engineering
                → Great Expectations: validate_features gate
                  → DVC: split_data
                    → DVC: classification_models (baseline, frac=0.01)
                      → DVC: tune_xgb + tune_hgb (Optuna, frac=0.1, n_trials=20)
                        → DVC: select_model (CV log-loss comparison)
                          → DVC: final_train (isotonic calibration, calib_frac=0.15)
                            → DVC: register_model → MLflow Registry (alias: smoke)
                              → DVC: promote_model (auto → candidate; manual gate → champion)
                                → DVC: batch_inference → predictions.parquet → MinIO
                                  → DVC: monitor_drift (Evidently drift report)

LogReg tuning is in the DAG (tune_logreg) but disabled in production (tuning_logreg.enabled: false). Full stage reference: Training Pipeline.


Inference request flow

sequenceDiagram
    participant Client
    participant FastAPI
    participant Redis
    participant RabbitMQ
    participant WorkerML

    Client->>FastAPI: GET /predict/{match_id}  [X-API-Key]
    FastAPI->>FastAPI: Auth check · feature lookup (match_features.parquet)
    FastAPI->>Redis: Cache lookup  predict:{match_id}:{run_id}
    alt cache hit
        Redis-->>FastAPI: cached result  [cached=true]
        FastAPI-->>Client: 200 OK
    else cache miss
        FastAPI->>RabbitMQ: enqueue predict_match task → ml queue
        RabbitMQ->>WorkerML: dispatch task
        WorkerML->>WorkerML: run inference (champion model, loaded on worker init)
        WorkerML->>Redis: cache result (TTL=PREDICTION_CACHE_TTL)
        WorkerML-->>FastAPI: result dict  (30 s poll)
        FastAPI-->>Client: 200 OK  /  504 on timeout
    end

Precomputed path (no Celery, no cache miss possible): GET /predict/predictions/ · GET /predict/precomputed/{match_id} · GET /predict/cards/ → read directly from in-memory predictions.parquet cache (refreshed from MinIO).


Production reality

Area Status Notes
End-to-end scrape → inference ✅ Operational Full path running at soccer.dmitryivanov.dev
Sync inference (GET /predict/{match_id}) ✅ Operational 30 s SLO; Celery ml queue
Precomputed predictions (5 endpoints) ✅ Operational Served from in-memory parquet, no Celery
Redis prediction cache ✅ Operational Implemented in PredictionService; gracefully degrades if Redis unreachable
MLflow model registry ✅ Operational smoke → candidate automated; champion requires manual sign-off
Prometheus metrics + Evidently drift ✅ Operational 9 metrics; daily drift DAG
Grafana dashboards ✅ Operational 2 dashboards deployed
POST /predict/async/ HTTP route 🚧 Scaffold Celery task, Pydantic schemas, _task_accepted() helper, and polling endpoint (GET /monitoring/task_status/{task_id}) exist; @router.post("/async/") not registered
prediction_duration_seconds metric 🚧 Defined only Prometheus histogram defined in metrics.py; not yet instrumented in router
Alertmanager rules 📋 Planned Runbooks exist; rules not deployed
Automated champion promotion 📋 Planned Current gate: promote_model auto-promotes to candidate; human sign-off required for champion
Automated retraining trigger 📋 Planned Manual dvc repro or CI job today
Cache invalidation on model promotion 📋 Planned Redis TTL-based expiry only; no flush on champion swap
H2H features ⚙️ Disabled include_h2h: false in production params
Rest-days features ⚙️ Disabled include_rest_days: false in production params
LogReg tuning ⚙️ Disabled tuning_logreg.enabled: false; stage runs as no-op

Operational characteristics

Reproducibility - All pipeline stages are DVC-tracked; dvc repro reproduces from any checkpoint. - Params are managed by Hydra (conf/config.yaml, overlay via conf/experiment/); params.yaml is the generated runtime snapshot. - All seeds are explicit; no hidden randomness.

Experiment tracking - Every training run is logged to MLflow: params, metrics, artifact URIs, data lineage. - Models are promoted through ci-smoke → smoke → candidate → champion aliases with explicit gates.

CI/CD - GitLab CI runs: lint (ruff, mypy), Dockerfile lint (hadolint), K8s manifest validation (kubeconform), Airflow DAG check, DVC DAG check, pytest, SAST (semgrep), container scan, smoke train, Helm deploy. - Image tagging uses $CI_COMMIT_SHA + latest. No per-branch tags. - train:smoke CI job runs the full 22-stage pipeline with experiment=smoke (frac=0.001) as a wiring check.

Security - All secrets encrypted with SOPS + age; age private key lives outside the repo. - /predict/* endpoints require X-API-Key header; /monitoring/* is unauthenticated (planned restriction). - Container images scanned by GitLab container_scanning in CI.

Infrastructure - Single-node Kubernetes on a self-hosted VPS (healserver). - A node failure is a full-service outage (no HA, no autoscaling, explicit tradeoff — see Trade-offs). - Monthly infra cost: <€30.


Repository map

Path What lives here
src/data/ Data access layer: schemas, splits, storage abstractions, DB session
src/features/ Feature engineering: pure, deterministic transforms; no IO
src/models/ Model wrappers, losses, metrics; no IO
src/pipelines/ DVC stage entrypoints (CLI); orchestration only — no domain logic
src/app/ FastAPI application: routers, services, Celery tasks, Pydantic schemas, metrics
src/app/routers/ Route handlers: predict, monitoring, healthcheck, livescores
src/app/tasks/ Celery tasks: predict_match, get_model_info
src/app/services/ Business services: PredictionService, FeatureLookupService, FonbetOddsService, …
airflow/dags/ Scheduled Airflow DAGs: ETL (scraping, Fonbet odds), ML monitoring, inference
conf/ Hydra configs: config.yaml (base), experiment/smoke.yaml, experiment/test.yaml
params.yaml DVC/Hydra-generated runtime parameter snapshot (source of truth for stage params)
dvc.yaml DVC pipeline: 22 stages, dependencies, outputs, metrics
k8s/ Kubernetes manifests (raw) and Helm chart (k8s/helm/)
docker/ Per-service Dockerfiles: api, celery-worker-ml, dvc-runner, airflow-custom, env-ml
tests/ Test suite: unit/, property/, service/, contract/, load/
docs/ MkDocs engineering documentation (this site)
reports/ Quarto analysis reports; not production code
conf/config.yaml Single config entrypoint; all pipeline params derive from here
.github/ (or .gitlab-ci.yml) CI/CD pipeline definition

Further reading

Topic Document
Implementation status (canonical) status.md
ML pipeline stages (all 22) training-pipeline.md
Inference modes and API contract inference-modes.md
Model registry and promotion gates model-registry.md
Prometheus metrics reference monitoring/metrics.md
Deployment topology and namespaces deployment-view.md
Container responsibilities (C4 L2) c4-containers.md
CI/CD pipeline stages cicd/index.md
Architecture principles principles.md
Known limitations and trade-offs tradeoffs.md
Quickstart (reproduce the pipeline) quickstart.md