Deployment View¶
This page describes where the system runs, how components are physically distributed, and how traffic flows from the internet to individual services.
Physical Topology¶
flowchart TB
subgraph Internet[Public Internet]
User[End User]
end
subgraph ExtVPS[External VPS — soccer.dmitryivanov.dev]
StreamlitUI[Streamlit Web UI]
end
subgraph ExtSelenoid[External Host — Selenoid]
SelenoidGrid[Selenoid Browser Grid]
end
subgraph HealServer[VPS — healserver — single node]
HostNginx[Host-level Nginx\nTLS termination — port 443]
subgraph K8s[Kubernetes — single-node cluster]
subgraph NS_Ingress[namespace: ingress-nginx]
Ingress[Nginx Ingress Controller\nNodePort 31390]
end
subgraph NS_DS[namespace: ds]
Airflow[Airflow\nScheduler + Workers]
PG[PostgreSQL]
MinIO[MinIO S3]
MLflow[MLflow\nTracking + Registry]
Prom[Prometheus]
Graf[Grafana\n📋 dashboards planned]
end
subgraph NS_Soccer[namespace: soccer-api]
API[FastAPI\nInference Service]
MQ[RabbitMQ]
WorkerAPI[Celery worker-api]
WorkerML[Celery worker-ml]
Redis[Redis Cache]
end
subgraph NS_Mon[namespace: monitoring]
KSM[kube-state-metrics]
NE[node-exporter]
end
end
end
User -->|HTTPS| ExtVPS
ExtVPS -->|HTTPS /predict| HostNginx
User -->|HTTPS /predict direct| HostNginx
HostNginx -->|NodePort 31390| Ingress
Ingress -->|/predict, /healthcheck, /metrics| API
API -->|enqueue task| MQ
MQ --> WorkerAPI
MQ --> WorkerML
WorkerAPI -->|browser session| SelenoidGrid
WorkerAPI --> PG
WorkerAPI --> Redis
WorkerML --> Redis
PG --> MinIO
MinIO -.->|dvc pull| MLpipeline[Offline ML Pipeline\nCI / local]
MLpipeline --> MLflow
API -->|model_uri| MLflow
KSM --> Prom
NE --> Prom
API --> Prom
WorkerAPI --> Prom
WorkerML --> Prom
Prom --> Graf
Namespace Layout¶
| Namespace | Services | Purpose |
|---|---|---|
ingress-nginx |
Nginx Ingress Controller | Routes inbound traffic to cluster services by hostname/path |
ds |
Airflow, PostgreSQL, MinIO, MLflow, Prometheus, Grafana | Data platform and ML infrastructure |
soccer-api |
FastAPI, RabbitMQ, Celery worker-api, Celery worker-ml, Redis | Inference service and async task infrastructure |
monitoring |
kube-state-metrics, node-exporter | K8s cluster and host-level metrics |
Ingress Path¶
Traffic from the public internet follows this path:
Internet
→ host-level Nginx (port 443, TLS termination, VPS)
→ K8s NodePort 31390
→ Nginx Ingress Controller (namespace: ingress-nginx)
→ FastAPI service (namespace: soccer-api)
Key notes:
- TLS is terminated at the host-level Nginx, which acts as a reverse proxy to the K8s NodePort.
- The Ingress Controller routes requests to services by hostname and path prefix.
- No service in ds or monitoring is publicly exposed; internal-cluster access only.
External Services¶
| Service | Host | Role | K8s integration |
|---|---|---|---|
| Selenoid Browser Grid | Dedicated external host | Headless Chrome sessions for WhoScored scraping | Called by celery-worker-api over HTTP; not inside K8s cluster |
| Streamlit Web UI | External VPS (soccer.dmitryivanov.dev) |
User-facing prediction interface | Calls FastAPI over public HTTPS; no direct cluster access |
| GitLab CI/CD | GitLab.com SaaS | Build, test, and deploy pipeline | Pushes Helm charts and secrets to healserver via SSH |
Helm Chart Structure¶
All Kubernetes resources are managed via Helm charts in k8s/helm/.
k8s/helm/
soccer-api/ — FastAPI + Celery + RabbitMQ + Redis
airflow/ — Airflow deployment (custom values)
monitoring/ — Prometheus + Grafana + exporters
...
Secrets are provided as SOPS-encrypted Helm values files (values-*.enc.yaml).
CI decrypts them at deploy time using the age private key from a protected CI variable.
Deployment Constraints¶
These constraints are architectural facts, not temporary limitations:
| Constraint | Architectural consequence |
|---|---|
| Single-node Kubernetes | No pod rescheduling across nodes; node failure is a full-service outage. Designed for portfolio/demo scale. |
| No High Availability | No replicated control plane; no multi-node worker pool. Accepted tradeoff against infrastructure cost. |
| Self-hosted VPS | Full operational responsibility: K8s upgrades, disk management, TLS renewal, backup. |
| External Selenoid host | Browser automation is outside the cluster network boundary; an independent failure domain not covered by K8s health probes. |
| Single RabbitMQ broker | Message queue is a single point of failure for the inference path. Acceptable at current throughput. |
These constraints are documented explicitly because they affect reasoning about failure modes, scaling, and future migration.
Known Limitations¶
| Limitation | Impact | Mitigation |
|---|---|---|
| Single-node K8s cluster | No HA; node failure = full outage | Manual recovery via runbook; acceptable for portfolio scope |
| No cluster autoscaling | Cannot scale under load | Workload is light; manual scaling if needed |
| Selenoid runs outside K8s | Separate ops boundary; no K8s health probes | Monitored externally; scraping failures surface via Airflow |
| Single RabbitMQ broker | No message queue HA | Acceptable at current throughput; documented as known limit |
| No automated certificate renewal (if LE not configured) | TLS certificate expiry | Operator runbook; or Let's Encrypt with certbot/cert-manager |
Portability Note¶
The Helm charts are parameterized with no hardcoded values specific to healserver.
Migration to a managed Kubernetes cluster (GKE, EKS, AKS) requires:
- Update DNS and TLS entries in chart values.
- Replace MinIO with cloud object storage (update DVC remote config).
- Replace self-managed PostgreSQL with a managed instance if desired.
- Re-encrypt SOPS secrets with an updated age key.
No code changes are required.
Related¶
- System Boundary — what is inside vs outside the cluster
- Container View — logical container responsibilities
- Security — TLS, namespace isolation, and secret injection
- Failure Modes — deployment-level failure scenarios
Environments & Dependency Strategy¶
SoccerPredictAI uses a layered dependency approach to maximize reproducibility and eliminate "works on my machine" issues across dev, CI, training, and production.
Dependency Layering Strategy¶
Layer 1 — System + Python runtime
conda / mamba (environment.yml)
→ exported to requirements-mamba-base.txt
Layer 2 — Python application dependencies
PDM groups: api / ml / dev / prod
→ exported per group to requirements-pdm-*.txt
Layer 3 — Final pinned artifacts
Merged into requirements-*.txt
→ used for deterministic Docker builds
Why this design?
- conda handles system-level and compiled library dependencies reliably.
- PDM provides modern dependency resolution and group-based separation (api/ml/dev).
- Exporting to pinned requirements-*.txt ensures Docker images are reproducible and auditable without requiring conda in the container build chain.
Environment Matrix¶
| Environment | Purpose | Dependency anchor | Python version | Activation |
|---|---|---|---|---|
| Local development | Code authoring, debugging, test runs | conda env + pdm install --dev |
3.13 (from environment.yml) |
conda activate soccer |
| CI (GitLab) | Lint, test, build, deploy | pdm install from pdm.lock |
3.13 (pinned in CI image) | CI runner environment |
| Offline ML training | dvc repro, experiment runs |
requirements-ml.txt (pinned) |
3.13 | conda or Docker container |
| Deployed runtime (API) | FastAPI + Celery workers serving predictions | requirements-prod.txt (pinned) |
3.13 | K8s pod from Docker image |
| Docs / reporting | MkDocs build, Quarto reports | requirements-dev.txt subset |
3.13 | Local dev env |
Reproducibility Anchors¶
Every deployed model and dataset is traceable to four anchors:
| Anchor | What it pins |
|---|---|
git commit |
Code version |
pdm.lock |
All Python dependency versions |
| DVC content hash | Exact dataset version used for training |
| MLflow run ID | All training parameters, metrics, and artifacts |
A deployment is fully reproducible when all four anchors are recorded.
Deployment manifests in k8s/ reference the Docker image tag, which maps to a specific git commit and pdm.lock.
Dependency Groups (PDM)¶
| Group | Contents | Used by |
|---|---|---|
api |
FastAPI, Pydantic, Celery, Redis client | API Docker image |
ml |
scikit-learn, XGBoost, Optuna, MLflow, DVC | Training pipeline Docker image / local |
dev |
pytest, hypothesis, ruff, pre-commit, mypy | CI + local development |
prod |
Combined api + ml for production deployment | Production Docker image |
How to Rebuild Pinned Requirements¶
This regenerates:
- PDM exports per group (requirements-pdm-*.txt)
- Base pip freeze from conda env (requirements-mamba-base.txt)
- Merged final requirements-*.txt for Docker builds
Run this whenever pdm.lock or environment.yml changes and before building new Docker images.
Operational Note¶
The system treats pdm.lock and DVC content hashes as the primary reproducibility anchors.
All production deployments should be traceable to:
- git commit
- dataset version (DVC hash)
- model version (MLflow run ID + registered version)
- dependency lock (
pdm.lock)
No deployment should be performed from an environment where any of these anchors is unresolved.