Skip to content

CI/CD Overview

The CI/CD layer automates the build and validation stages of the SoccerPredictAI MLOps system, and supports structured deployment via Helm.

Its primary goals are: - enforce quality gates before changes reach production, - provide deterministic and reproducible builds, - support structured deployment with controlled promotion, - reduce manual operational risk.

CI/CD is treated as a core part of the ML system, not as an auxiliary tool.


Current state

Capability Status
GitLab CI pipeline Implemented
Docker image build and push Implemented
Helm-based deployment Implemented (semi-automated)
Production deployment Manual approval required
Rollback (service) Manual (helm rollback)
Rollback (model) Manual (MLflow alias)
Rollback (data) Manual (dvc checkout)

Production deployments require manual approval. Rollbacks across all layers are performed manually.


Pipeline Architecture

Stages

Stage Purpose
base Prepare base images and shared artifacts
linting Code style and static analysis
build Build Docker images for services
deploy-images Push images to the container registry
deploy Deploy services via Helm to Kubernetes
release Tag and promote releases
pages Build and publish documentation

Pipeline philosophy

  • fail fast on quality issues,
  • separate build from deploy,
  • promote artifacts, not source code,
  • keep production deploys explicit and auditable.

Triggering rules: merge requests trigger validation (lint, test, build); pushes to main trigger build + image push; staging deploy requires manual trigger after CI passes; production deploy requires manual approval and quality gates.


Container Image Strategy

Each service is packaged as a separate immutable Docker image (API service, Celery worker, Airflow components). No secrets are baked into images; dependencies come from pinned requirements-*.txt files exported from pdm.lock.

Image tagging scheme:

Context Tag format Example
Branch build (CI) <branch>-<short-sha> main-a1b2c3d
Release v<major>.<minor>.<patch> v1.2.0
Latest stable latest (staging/prod only) latest

The same image artifact is promoted across environments — no rebuilds between staging and production.


Deployment (Helm)

Deployments use Helm charts for reproducible configuration, environment-specific overrides, and safe rollbacks.

Deployment flow: 1. CI decrypts secrets (SOPS) 2. Helm renders manifests with environment values 3. Kubernetes applies the release 4. Readiness probes gate traffic

Failed deployments do not receive traffic. Rollbacks do not require rebuilding images.


Release & Rollback Policy

Release cadence:

Environment Trigger Approvals Required
dev Every push to main None
staging Manual trigger after CI passes 1 reviewer
production Manual trigger + quality gates pass 2 reviewers

Quality gates before release: all tests green, ruff pass, dvc repro --dry succeeds, model metrics meet champion baseline, no HIGH severity container scan findings.

Rollback process (manual across three independent layers):

# Service rollback (Helm)
helm rollback soccer-api

# Model rollback (MLflow — reassign champion alias to prior version)
# Via MLflow UI or mlflow CLI

# Data rollback (DVC)
dvc checkout <commit>

Rollbacks are never automated. All rollback decisions require human review. Re-run CI after any rollback to confirm system state.


Quality Gates

ML systems fail not only due to bugs, but due to data issues, silent regressions, and configuration drift. Quality gates prevent unsafe changes from reaching production.

Implemented gates

Gate Category Blocks deploy?
Linting and formatting (ruff) Code quality ✅ Yes
Unit + property-based tests (pytest + Hypothesis) Testing ✅ Yes
Critical Great Expectations checks Data contracts ✅ Yes
Pipeline smoke run (reduced dataset) ML sanity ✅ Yes
API contract test (happy path + invalid schema) Serving ✅ Yes

Non-blocking (signal-only)

  • Drift warnings (Evidently)
  • Non-critical GE checks (distribution or advisory expectations)
  • Performance regression checks (initially informational)

Artifact traceability

Every production deployment can be traced to: - git commit, Docker image digest, dataset version, model version.