Skip to content

System Context (C4 — Level 1)

This diagram defines the system boundary, external actors, and integrations. For physical deployment topology see Deployment View.


Context Diagram

flowchart TB
    subgraph External[External — outside system boundary]
        User[End User / Viewer]
        Source[WhoScored.com\nexternal data provider]
        Selenoid["Selenoid Server\nexternal host — browser automation"]
        StreamlitUI["Time2Bet Web UI\nStreamlit — external VPS"]
        CI["GitLab CI/CD\nbuild · test · deploy"]
    end

    subgraph Offline[Offline execution context]
        DVC[DVC Pipeline\ntraining · validation · registration]
    end

    subgraph System[SoccerPredictAI — K8s cluster — healserver]
        API[FastAPI Inference API]
        Airflow[Airflow ETL Scheduler]
        Workers[Celery Workers]
        DB[PostgreSQL]
        S3[MinIO S3]
        MLflow[MLflow Registry]
        Prom[Prometheus]
    end

    User -->|views predictions| StreamlitUI
    StreamlitUI -->|HTTPS /predict| API

    Airflow -->|HTTP trigger scraping| API
    API -->|enqueue Celery task| Workers
    Workers -->|browser session| Selenoid
    Source -->|scraped via browser| Selenoid
    Workers -->|normalized data| DB
    DB -->|raw parquet export| S3

    S3 -->|versioned data| DVC
    DVC -->|model artifacts + metrics| MLflow
    API -->|load model_uri| MLflow

    API --> Prom
    Workers --> Prom

    CI -->|Helm deploy| System

Actors and Roles

End User / Viewer

Consumes match outcome predictions via the Streamlit web interface hosted on an independent external VPS (soccer.dmitryivanov.dev). Has no direct access to internal cluster services.

System Operator

Deploys, monitors, and maintains the system. Has SSH access to healserver, access to GitLab CI protected variables, and the age private key. The only human actor with direct cluster access.

WhoScored.com

Third-party source of football match statistics. Treated as an untrusted external input — all data is validated via Great Expectations before use. Subject to layout changes, rate limiting, and availability issues outside operator control.

Selenoid Server

Dedicated external host running a Selenoid browser grid. Invoked by celery-worker-api to perform headless browser scraping against WhoScored. Operator-managed, but runs outside the Kubernetes cluster — a separate operational boundary.

Time2Bet Web UI (Streamlit)

User-facing prediction frontend hosted on an external VPS. Calls the inference API over public HTTPS. Outside the system boundary; dependent on API availability.

DVC Pipeline (Offline Execution Context)

The ML training pipeline. Runs outside the K8s cluster — locally or in CI — against MinIO (data) and MLflow (model artifacts). Not a runtime component; produces the versioned model artifacts that the serving layer consumes.

GitLab CI/CD

Manages build, test, and deployment pipelines. Part of the delivery boundary: pushes Helm deployments to the cluster and handles secret decryption during the deploy phase. Does not participate in the runtime execution path.


System Responsibilities

SoccerPredictAI is responsible for:

  • scraping and ingesting match data from WhoScored via Selenoid,
  • normalizing and storing structured data in PostgreSQL,
  • exporting versioned datasets to MinIO via DVC,
  • training match outcome prediction models reproducibly,
  • tracking experiments and managing model lifecycle in MLflow,
  • serving predictions synchronously and asynchronously via FastAPI + Celery,
  • exposing service health and Prometheus metrics for observability.

Non-Goals

  • The system does not guarantee betting profitability.
  • It is not a general sports analytics platform.
  • It does not support multiple data providers or sports.
  • It does not provide user authentication or multi-tenant access control.


System Boundary

This section defines what is inside the SoccerPredictAI system, what is outside it, and how the two interact. Understanding the boundary is essential for reasoning about ownership, trust, and failure modes.

What Is Inside the System

The runtime system boundary includes all services responsible for prediction serving, data ingestion, model lifecycle, and observability. The offline ML pipeline is part of the system when it produces artifacts consumed at runtime (models in MLflow, data in DVC/MinIO).

Runtime services (Kubernetes — healserver)

Component Namespace Responsibility
Nginx Ingress Controller ingress-nginx Routes inbound HTTPS traffic to internal services
Airflow Scheduler + Workers ds Schedules ETL and scraping triggers
PostgreSQL ds Authoritative store for normalized scraped data
MinIO (S3-compatible) ds DVC remote: raw parquet exports, ML artifacts
MLflow Tracking + Registry ds Experiment records, model versions, promotion lifecycle
Prometheus ds Metrics collection
Grafana ds Dashboards (📋 Planned: dashboards defined)
kube-state-metrics monitoring K8s cluster metrics
node-exporter monitoring Host-level metrics
FastAPI Inference Service soccer-api REST API, sync + async predictions, health + metrics endpoints
RabbitMQ soccer-api Message broker for Celery task queues
Celery worker-api soccer-api Short tasks: scraping trigger, cache operations, request pre-processing
Celery worker-ml soccer-api Heavy tasks: feature assembly at inference, batch scoring
Redis soccer-api Prediction and feature vector cache (caching optimization layer)

Offline execution context

Component Boundary Responsibility
DVC pipeline Local / CI execution Reproducible ML pipeline: preprocessing through model registration

Offline Pipeline Boundary

The DVC pipeline occupies a deliberate position in the system boundary: it executes outside the runtime cluster (locally or in CI), but it is part of the system as the authoritative producer of all ML artifacts consumed at runtime.

This is not an omission — it is an explicit architectural decision.

Why DVC is outside the runtime boundary:

  • The pipeline is artifact-driven and reproducible, not service-based. It does not run continuously.
  • Executing training inside Kubernetes would add operational complexity (GPU scheduling, ephemeral storage, long-running job management) without benefit at the current scale.
  • CI execution provides a clean, reproducible environment without cluster-side state entanglement.

Why DVC is still part of the system:

  • Every model in the runtime registry was produced by a tracked, versioned DVC run.
  • Every dataset consumed by training is content-addressed and reproducible via dvc checkout.
  • The DVC pipeline is the explicit handoff point from data to models: it reads from MinIO and writes registered artifacts into MLflow, which the runtime cluster reads.

Architectural consequence:

The boundary crossing happens at the MLflow Registry: DVC pushes a model artifact and assigns a champion alias; the serving layer loads it. This is the only coupling point between the offline pipeline and the runtime system. They share no runtime infrastructure, only contracts (model signature, feature schema, MLflow alias convention).

[DVC pipeline — local/CI]
        │  writes model artifact + champion alias
[MLflow Registry — runtime cluster]
        │  model_uri resolved by champion alias
[FastAPI + Celery workers — runtime serving]

Limitation: there is no automated handoff — model promotion is a manual operation today. See Known Architectural Limitations and Roadmap.

What Is Outside the System

External runtime dependencies

External component Owner Role Trust level
WhoScored.com Third party Source of football match statistics Untrusted; validated after ingestion
Selenoid Server Operator (external host) Headless browser grid for scraping; called by celery-worker-api Trusted operator; separate ops boundary
Streamlit Web UI (time2bet.ru) Operator (external VPS) User-facing prediction frontend Trusted; calls the inference API over HTTPS
Host-level Nginx (VPS) Operator Reverse proxy in front of K8s NodePort; handles TLS termination Trusted operator

Delivery and tooling boundary

External component Owner Role Trust level
GitLab CI/CD SaaS (GitLab.com) Build, test, and Helm deployment pipeline Trusted for delivery; accesses encrypted secrets via protected variables

GitLab CI/CD is outside the runtime system boundary: it does not participate in normal system operation. It crosses the boundary only during deployment events — at which point it decrypts SOPS-encrypted secrets and pushes Helm releases to the cluster.

External Dependency Trust Model

flowchart TB
    subgraph PublicInternet[Public Internet — Untrusted]
        WhoScored[WhoScored.com]
        User[End User]
    end

    subgraph ExternalOps[Operator-controlled — Trusted]
        Selenoid[Selenoid Server\nexternal host]
        StreamlitUI[Streamlit Web UI\next. VPS — time2bet.ru]
        HostNginx[Host-level Nginx\nVPS reverse proxy]
    end

    subgraph CI[GitLab CI/CD — Trusted]
        CICD[GitLab CI Runner\nbuild / test / deploy]
    end

    subgraph K8sCluster[Kubernetes Cluster — healserver — Internal]
        Ingress[Ingress Controller]
        API[FastAPI]
        Workers[Celery Workers]
        DB[PostgreSQL]
        S3[MinIO]
        MLflow[MLflow]
        Queue[RabbitMQ]
        Cache[Redis]
        Prom[Prometheus]
    end

    subgraph Offline[Offline / CI — ML Pipeline]
        DVC[DVC Pipeline]
    end

    WhoScored -->|scraped via browser| Selenoid
    Selenoid -->|normalized data| Workers
    User -->|HTTPS| StreamlitUI
    StreamlitUI -->|HTTPS /predict| HostNginx
    HostNginx -->|NodePort 31390| Ingress
    Ingress --> API
    API --> Workers
    Workers --> DB
    DB --> S3
    S3 --> DVC
    DVC --> MLflow
    API -->|model_uri| MLflow
    CICD -->|Helm deploy| K8sCluster

Trust Boundaries

Public Internet — all requests from the public internet are untrusted by default. - WhoScored.com data is treated as untrusted input; Great Expectations validates it before use. - User requests to the API pass through Nginx TLS termination and Pydantic schema validation.

K8s Cluster Internal — services within the same namespace communicate freely via cluster DNS. Cross-namespace communication is restricted via Kubernetes NetworkPolicy (where defined). No service inside the cluster exposes a plaintext secret to application code — all secrets are injected via Kubernetes Secrets from SOPS-decrypted manifests.

External Scraping Host (Selenoid) — operator-controlled but outside the K8s network boundary. Traffic from celery-worker-api to Selenoid crosses the network boundary. This is an accepted operational dependency; Selenoid unavailability is a known failure mode.

CI/CD Boundary — GitLab CI has access to: the source code repository, encrypted SOPS secret files (committed to git), and the age private key (stored as a protected CI variable). CI decrypts secrets only in scoped deployment steps. No secret appears in CI logs (masked variables enforced).