Skip to content

Deployment View

This page describes where the system runs, how components are physically distributed, and how traffic flows from the internet to individual services.


Physical Topology

flowchart TB
    subgraph Internet[Public Internet]
        User[End User]
    end

    subgraph ExtVPS[External VPS — soccer.dmitryivanov.dev]
        StreamlitUI[Streamlit Web UI]
    end

    subgraph ExtSelenoid[External Host — Selenoid]
        SelenoidGrid[Selenoid Browser Grid]
    end

    subgraph HealServer[VPS — healserver — single node]
        HostNginx[Host-level Nginx\nTLS termination — port 443]

        subgraph K8s[Kubernetes — single-node cluster]
            subgraph NS_Ingress[namespace: ingress-nginx]
                Ingress[Nginx Ingress Controller\nNodePort 31390]
            end

            subgraph NS_DS[namespace: ds]
                Airflow[Airflow\nScheduler + Workers]
                PG[PostgreSQL]
                MinIO[MinIO S3]
                MLflow[MLflow\nTracking + Registry]
                Prom[Prometheus]
                Graf[Grafana\n📋 dashboards planned]
            end

            subgraph NS_Soccer[namespace: soccer-api]
                API[FastAPI\nInference Service]
                MQ[RabbitMQ]
                WorkerAPI[Celery worker-api]
                WorkerML[Celery worker-ml]
                Redis[Redis Cache]
            end

            subgraph NS_Mon[namespace: monitoring]
                KSM[kube-state-metrics]
                NE[node-exporter]
            end
        end
    end

    User -->|HTTPS| ExtVPS
    ExtVPS -->|HTTPS /predict| HostNginx
    User -->|HTTPS /predict direct| HostNginx
    HostNginx -->|NodePort 31390| Ingress
    Ingress -->|/predict, /healthcheck, /metrics| API
    API -->|enqueue task| MQ
    MQ --> WorkerAPI
    MQ --> WorkerML
    WorkerAPI -->|browser session| SelenoidGrid
    WorkerAPI --> PG
    WorkerAPI --> Redis
    WorkerML --> Redis
    PG --> MinIO
    MinIO -.->|dvc pull| MLpipeline[Offline ML Pipeline\nCI / local]
    MLpipeline --> MLflow
    API -->|model_uri| MLflow
    KSM --> Prom
    NE --> Prom
    API --> Prom
    WorkerAPI --> Prom
    WorkerML --> Prom
    Prom --> Graf

Namespace Layout

Namespace Services Purpose
ingress-nginx Nginx Ingress Controller Routes inbound traffic to cluster services by hostname/path
ds Airflow, PostgreSQL, MinIO, MLflow, Prometheus, Grafana Data platform and ML infrastructure
soccer-api FastAPI, RabbitMQ, Celery worker-api, Celery worker-ml, Redis Inference service and async task infrastructure
monitoring kube-state-metrics, node-exporter K8s cluster and host-level metrics

Ingress Path

Traffic from the public internet follows this path:

Internet
  → host-level Nginx (port 443, TLS termination, VPS)
    → K8s NodePort 31390
      → Nginx Ingress Controller (namespace: ingress-nginx)
        → FastAPI service (namespace: soccer-api)

Key notes: - TLS is terminated at the host-level Nginx, which acts as a reverse proxy to the K8s NodePort. - The Ingress Controller routes requests to services by hostname and path prefix. - No service in ds or monitoring is publicly exposed; internal-cluster access only.


External Services

Service Host Role K8s integration
Selenoid Browser Grid Dedicated external host Headless Chrome sessions for WhoScored scraping Called by celery-worker-api over HTTP; not inside K8s cluster
Streamlit Web UI External VPS (soccer.dmitryivanov.dev) User-facing prediction interface Calls FastAPI over public HTTPS; no direct cluster access
GitLab CI/CD GitLab.com SaaS Build, test, and deploy pipeline Pushes Helm charts and secrets to healserver via SSH

Helm Chart Structure

All Kubernetes resources are managed via Helm charts in k8s/helm/.

k8s/helm/
  soccer-api/        — FastAPI + Celery + RabbitMQ + Redis
  airflow/           — Airflow deployment (custom values)
  monitoring/        — Prometheus + Grafana + exporters
  ...

Secrets are provided as SOPS-encrypted Helm values files (values-*.enc.yaml). CI decrypts them at deploy time using the age private key from a protected CI variable.


Deployment Constraints

These constraints are architectural facts, not temporary limitations:

Constraint Architectural consequence
Single-node Kubernetes No pod rescheduling across nodes; node failure is a full-service outage. Designed for portfolio/demo scale.
No High Availability No replicated control plane; no multi-node worker pool. Accepted tradeoff against infrastructure cost.
Self-hosted VPS Full operational responsibility: K8s upgrades, disk management, TLS renewal, backup.
External Selenoid host Browser automation is outside the cluster network boundary; an independent failure domain not covered by K8s health probes.
Single RabbitMQ broker Message queue is a single point of failure for the inference path. Acceptable at current throughput.

These constraints are documented explicitly because they affect reasoning about failure modes, scaling, and future migration.


Known Limitations

Limitation Impact Mitigation
Single-node K8s cluster No HA; node failure = full outage Manual recovery via runbook; acceptable for portfolio scope
No cluster autoscaling Cannot scale under load Workload is light; manual scaling if needed
Selenoid runs outside K8s Separate ops boundary; no K8s health probes Monitored externally; scraping failures surface via Airflow
Single RabbitMQ broker No message queue HA Acceptable at current throughput; documented as known limit
No automated certificate renewal (if LE not configured) TLS certificate expiry Operator runbook; or Let's Encrypt with certbot/cert-manager

Portability Note

The Helm charts are parameterized with no hardcoded values specific to healserver. Migration to a managed Kubernetes cluster (GKE, EKS, AKS) requires:

  1. Update DNS and TLS entries in chart values.
  2. Replace MinIO with cloud object storage (update DVC remote config).
  3. Replace self-managed PostgreSQL with a managed instance if desired.
  4. Re-encrypt SOPS secrets with an updated age key.

No code changes are required.



Environments & Dependency Strategy

SoccerPredictAI uses a layered dependency approach to maximize reproducibility and eliminate "works on my machine" issues across dev, CI, training, and production.

Dependency Layering Strategy

Layer 1 — System + Python runtime
    conda / mamba (environment.yml)
    → exported to requirements-mamba-base.txt

Layer 2 — Python application dependencies
    PDM groups: api / ml / dev / prod
    → exported per group to requirements-pdm-*.txt

Layer 3 — Final pinned artifacts
    Merged into requirements-*.txt
    → used for deterministic Docker builds

Why this design? - conda handles system-level and compiled library dependencies reliably. - PDM provides modern dependency resolution and group-based separation (api/ml/dev). - Exporting to pinned requirements-*.txt ensures Docker images are reproducible and auditable without requiring conda in the container build chain.

Environment Matrix

Environment Purpose Dependency anchor Python version Activation
Local development Code authoring, debugging, test runs conda env + pdm install --dev 3.13 (from environment.yml) conda activate soccer
CI (GitLab) Lint, test, build, deploy pdm install from pdm.lock 3.13 (pinned in CI image) CI runner environment
Offline ML training dvc repro, experiment runs requirements-ml.txt (pinned) 3.13 conda or Docker container
Deployed runtime (API) FastAPI + Celery workers serving predictions requirements-prod.txt (pinned) 3.13 K8s pod from Docker image
Docs / reporting MkDocs build, Quarto reports requirements-dev.txt subset 3.13 Local dev env

Reproducibility Anchors

Every deployed model and dataset is traceable to four anchors:

Anchor What it pins
git commit Code version
pdm.lock All Python dependency versions
DVC content hash Exact dataset version used for training
MLflow run ID All training parameters, metrics, and artifacts

A deployment is fully reproducible when all four anchors are recorded. Deployment manifests in k8s/ reference the Docker image tag, which maps to a specific git commit and pdm.lock.

Dependency Groups (PDM)

Group Contents Used by
api FastAPI, Pydantic, Celery, Redis client API Docker image
ml scikit-learn, XGBoost, Optuna, MLflow, DVC Training pipeline Docker image / local
dev pytest, hypothesis, ruff, pre-commit, mypy CI + local development
prod Combined api + ml for production deployment Production Docker image

How to Rebuild Pinned Requirements

make requirements

This regenerates: - PDM exports per group (requirements-pdm-*.txt) - Base pip freeze from conda env (requirements-mamba-base.txt) - Merged final requirements-*.txt for Docker builds

Run this whenever pdm.lock or environment.yml changes and before building new Docker images.

Operational Note

The system treats pdm.lock and DVC content hashes as the primary reproducibility anchors. All production deployments should be traceable to:

  • git commit
  • dataset version (DVC hash)
  • model version (MLflow run ID + registered version)
  • dependency lock (pdm.lock)

No deployment should be performed from an environment where any of these anchors is unresolved.