CI/CD Overview¶

The CI/CD layer automates the build and validation stages of the SoccerPredictAI MLOps system, and supports structured deployment via Helm.

Its primary goals are: - enforce quality gates before changes reach production, - provide deterministic and reproducible builds, - support structured deployment with controlled promotion, - reduce manual operational risk.

CI/CD is treated as a core part of the ML system, not as an auxiliary tool.

Current state¶

Capability	Status
GitLab CI pipeline	Implemented
Docker image build and push	Implemented
Helm-based deployment	Implemented (semi-automated)
Production deployment	Manual approval required
Rollback (service)	Manual (`helm rollback`)
Rollback (model)	Manual (MLflow alias)
Rollback (data)	Manual (`dvc checkout`)

Production deployments require manual approval. Rollbacks across all layers are performed manually.

Pipeline Architecture¶

Stages¶

Stage	Purpose
`base`	Prepare base images and shared artifacts
`linting`	Code style and static analysis
`build`	Build Docker images for services
`deploy-images`	Push images to the container registry
`deploy`	Deploy services via Helm to Kubernetes
`release`	Tag and promote releases
`pages`	Build and publish documentation

Pipeline philosophy¶

fail fast on quality issues,
separate build from deploy,
promote artifacts, not source code,
keep production deploys explicit and auditable.

Triggering rules: merge requests trigger validation (lint, test, build); pushes to main trigger build + image push; staging deploy requires manual trigger after CI passes; production deploy requires manual approval and quality gates.

Container Image Strategy¶

Each service is packaged as a separate immutable Docker image (API service, Celery worker, Airflow components). No secrets are baked into images; dependencies come from pinned requirements-*.txt files exported from pdm.lock.

Image tagging scheme:

Context	Tag format	Example
Branch build (CI)	`<branch>-<short-sha>`	`main-a1b2c3d`
Release	`v<major>.<minor>.<patch>`	`v1.2.0`
Latest stable	`latest` (staging/prod only)	`latest`

The same image artifact is promoted across environments — no rebuilds between staging and production.

Deployment (Helm)¶

Deployments use Helm charts for reproducible configuration, environment-specific overrides, and safe rollbacks.

Deployment flow: 1. CI decrypts secrets (SOPS) 2. Helm renders manifests with environment values 3. Kubernetes applies the release 4. Readiness probes gate traffic

Failed deployments do not receive traffic. Rollbacks do not require rebuilding images.

Release & Rollback Policy¶

Release cadence:

Environment	Trigger	Approvals Required
`dev`	Every push to `main`	None
`staging`	Manual trigger after CI passes	1 reviewer
`production`	Manual trigger + quality gates pass	2 reviewers

Quality gates before release: all tests green, ruff pass, dvc repro --dry succeeds, model metrics meet champion baseline, no HIGH severity container scan findings.

Rollback process (manual across three independent layers):

# Service rollback (Helm)
helm rollback soccer-api

# Model rollback (MLflow — reassign champion alias to prior version)
# Via MLflow UI or mlflow CLI

# Data rollback (DVC)
dvc checkout <commit>

Rollbacks are never automated. All rollback decisions require human review. Re-run CI after any rollback to confirm system state.

Quality Gates¶

ML systems fail not only due to bugs, but due to data issues, silent regressions, and configuration drift. Quality gates prevent unsafe changes from reaching production.

Implemented gates¶

Gate	Category	Blocks deploy?
Linting and formatting (ruff)	Code quality	✅ Yes
Unit + property-based tests (pytest + Hypothesis)	Testing	✅ Yes
Critical Great Expectations checks	Data contracts	✅ Yes
Pipeline smoke run (reduced dataset)	ML sanity	✅ Yes
API contract test (happy path + invalid schema)	Serving	✅ Yes

Non-blocking (signal-only)¶

Drift warnings (Evidently)
Non-critical GE checks (distribution or advisory expectations)
Performance regression checks (initially informational)

Artifact traceability¶

Every production deployment can be traced to: - git commit, Docker image digest, dataset version, model version.

Testing Strategy