Deployment View¶

This page describes where the system runs, how components are physically distributed, and how traffic flows from the internet to individual services.

Physical Topology¶

flowchart TB
    subgraph Internet[Public Internet]
        User[End User]
    end

    subgraph ExtVPS[External VPS — soccer.dmitryivanov.dev]
        StreamlitUI[Streamlit Web UI]
    end

    subgraph ExtSelenoid[External Host — Selenoid]
        SelenoidGrid[Selenoid Browser Grid]
    end

    subgraph HealServer[VPS — healserver — single node]
        HostNginx[Host-level Nginx\nTLS termination — port 443]

        subgraph K8s[Kubernetes — single-node cluster]
            subgraph NS_Ingress[namespace: ingress-nginx]
                Ingress[Nginx Ingress Controller\nNodePort 31390]
            end

            subgraph NS_DS[namespace: ds]
                Airflow[Airflow\nScheduler + Workers]
                PG[PostgreSQL]
                MinIO[MinIO S3]
                MLflow[MLflow\nTracking + Registry]
                Prom[Prometheus]
                Graf[Grafana\n✅ deployed]
            end

            subgraph NS_Soccer[namespace: soccer-api]
                API[FastAPI\nInference Service]
                MQ[RabbitMQ]
                WorkerAPI[Celery worker-api]
                WorkerML[Celery worker-ml]
                Redis[Redis Cache]
            end

            subgraph NS_Mon[namespace: monitoring]
                KSM[kube-state-metrics]
                NE[node-exporter]
            end
        end
    end

    User -->|HTTPS| ExtVPS
    ExtVPS -->|HTTPS /predict| HostNginx
    User -->|HTTPS /predict direct| HostNginx
    HostNginx -->|NodePort 31390| Ingress
    Ingress -->|/predict, /healthcheck, /metrics| API
    API -->|enqueue task| MQ
    MQ --> WorkerAPI
    MQ --> WorkerML
    WorkerAPI -->|browser session| SelenoidGrid
    WorkerAPI --> PG
    WorkerAPI --> Redis
    WorkerML --> Redis
    PG --> MinIO
    MinIO -.->|dvc pull| MLpipeline[Offline ML Pipeline\nCI / local]
    MLpipeline --> MLflow
    API -->|model_uri| MLflow
    KSM --> Prom
    NE --> Prom
    API --> Prom
    WorkerAPI --> Prom
    WorkerML --> Prom
    Prom --> Graf

Namespace Layout¶

Namespace	Services	Purpose
`ingress-nginx`	Nginx Ingress Controller	Routes inbound traffic to cluster services by hostname/path
`ds`	Airflow, PostgreSQL, MinIO, MLflow, Prometheus, Grafana	Data platform and ML infrastructure
`soccer-api`	FastAPI, RabbitMQ, Celery worker-api, Celery worker-ml, Redis	Inference service and async task infrastructure
`monitoring`	kube-state-metrics, node-exporter	K8s cluster and host-level metrics

Ingress Path¶

Traffic from the public internet follows this path:

Internet
  → host-level Nginx (port 443, TLS termination, VPS)
    → K8s NodePort 31390
      → Nginx Ingress Controller (namespace: ingress-nginx)
        → FastAPI service (namespace: soccer-api)

Key notes: - TLS is terminated at the host-level Nginx, which acts as a reverse proxy to the K8s NodePort. - The Ingress Controller routes requests to services by hostname and path prefix. - No service in ds or monitoring is publicly exposed; internal-cluster access only.

External Services¶

Service	Host	Role	K8s integration
Selenoid Browser Grid	Dedicated external host	Headless Chrome sessions for WhoScored scraping	Called by `celery-worker-api` over HTTP; not inside K8s cluster
Streamlit Web UI	External VPS (`soccer.dmitryivanov.dev`)	User-facing prediction interface	Calls FastAPI over public HTTPS; no direct cluster access
GitLab CI/CD	GitLab.com SaaS	Build, test, and deploy pipeline	Pushes Helm charts and secrets to healserver via SSH

Helm Chart Structure¶

All Kubernetes resources are managed via Helm charts in k8s/helm/.

k8s/helm/
  soccer-api/        — FastAPI + Celery + RabbitMQ + Redis
  airflow/           — Airflow deployment (custom values)
  monitoring/        — Prometheus + Grafana + exporters
  ...

Secrets are provided as SOPS-encrypted Helm values files (values-*.enc.yaml). CI decrypts them at deploy time using the age private key from a protected CI variable.

Deployment Constraints¶

These constraints are architectural facts, not temporary limitations:

Constraint	Architectural consequence
Single-node Kubernetes	No pod rescheduling across nodes; node failure is a full-service outage. Designed for portfolio/demo scale.
No High Availability	No replicated control plane; no multi-node worker pool. Accepted tradeoff against infrastructure cost.
Self-hosted VPS	Full operational responsibility: K8s upgrades, disk management, TLS renewal, backup.
External Selenoid host	Browser automation is outside the cluster network boundary; an independent failure domain not covered by K8s health probes.
Single RabbitMQ broker	Message queue is a single point of failure for the inference path. Acceptable at current throughput.

These constraints are documented explicitly because they affect reasoning about failure modes, scaling, and future migration.

Known Limitations¶

Limitation	Impact	Mitigation
Single-node K8s cluster	No HA; node failure = full outage	Manual recovery via runbook; acceptable for portfolio scope
No cluster autoscaling	Cannot scale under load	Workload is light; manual scaling if needed
Selenoid runs outside K8s	Separate ops boundary; no K8s health probes	Monitored externally; scraping failures surface via Airflow
Single RabbitMQ broker	No message queue HA	Acceptable at current throughput; documented as known limit
No automated certificate renewal (if LE not configured)	TLS certificate expiry	Operator runbook; or Let's Encrypt with certbot/cert-manager

Portability Note¶

The Helm charts are parameterized with no hardcoded values specific to healserver. Migration to a managed Kubernetes cluster (GKE, EKS, AKS) requires:

Update DNS and TLS entries in chart values.
Replace MinIO with cloud object storage (update DVC remote config).
Replace self-managed PostgreSQL with a managed instance if desired.
Re-encrypt SOPS secrets with an updated age key.

No code changes are required.

System Boundary — what is inside vs outside the cluster
Container View — logical container responsibilities
Security — TLS, namespace isolation, and secret injection
Failure Modes — deployment-level failure scenarios

Environments & Dependency Strategy¶

SoccerPredictAI uses a layered dependency approach to maximize reproducibility and eliminate "works on my machine" issues across dev, CI, training, and production.

Dependency Layering Strategy¶

Layer 1 — System + Python runtime
    conda / mamba (environment.yml)
    → exported to requirements-mamba-base.txt

Layer 2 — Python application dependencies
    PDM groups: api / ml / dev / prod
    → exported per group to requirements-pdm-*.txt

Layer 3 — Final pinned artifacts
    Merged into requirements-*.txt
    → used for deterministic Docker builds

Why this design? - conda handles system-level and compiled library dependencies reliably. - PDM provides modern dependency resolution and group-based separation (api/ml/dev). - Exporting to pinned requirements-*.txt ensures Docker images are reproducible and auditable without requiring conda in the container build chain.

Environment Matrix¶

Environment	Purpose	Dependency anchor	Python version	Activation
Local development	Code authoring, debugging, test runs	`conda env` + `pdm install --dev`	3.13 (from `environment.yml`)	`conda activate soccer`
CI (GitLab)	Lint, test, build, deploy	`pdm install` from `pdm.lock`	3.13 (pinned in CI image)	CI runner environment
Offline ML training	`dvc repro`, experiment runs	`requirements-ml.txt` (pinned)	3.13	conda or Docker container
Deployed runtime (API)	FastAPI + Celery workers serving predictions	`requirements-prod.txt` (pinned)	3.13	K8s pod from Docker image
Docs / reporting	MkDocs build, Quarto reports	`requirements-dev.txt` subset	3.13	Local dev env

Reproducibility Anchors¶

Every deployed model and dataset is traceable to four anchors:

Anchor	What it pins
`git commit`	Code version
`pdm.lock`	All Python dependency versions
DVC content hash	Exact dataset version used for training
MLflow run ID	All training parameters, metrics, and artifacts

A deployment is fully reproducible when all four anchors are recorded. Deployment manifests in k8s/ reference the Docker image tag, which maps to a specific git commit and pdm.lock.

Dependency Groups (PDM)¶

Group	Contents	Used by
`api`	FastAPI, Pydantic, Celery, Redis client	API Docker image
`ml`	scikit-learn, XGBoost, Optuna, MLflow, DVC	Training pipeline Docker image / local
`dev`	pytest, hypothesis, ruff, pre-commit, mypy	CI + local development
`prod`	Combined api + ml for production deployment	Production Docker image

How to Rebuild Pinned Requirements¶

make requirements

This regenerates: - PDM exports per group (requirements-pdm-*.txt) - Base pip freeze from conda env (requirements-mamba-base.txt) - Merged final requirements-*.txt for Docker builds

Run this whenever pdm.lock or environment.yml changes and before building new Docker images.

Operational Note¶

The system treats pdm.lock and DVC content hashes as the primary reproducibility anchors. All production deployments should be traceable to:

git commit
dataset version (DVC hash)
model version (MLflow run ID + registered version)
dependency lock (pdm.lock)

No deployment should be performed from an environment where any of these anchors is unresolved.