Skip to content

SoccerPredictAI — End-to-End MLOps System

SoccerPredictAI is a production-style end-to-end MLOps platform for football match prediction. Raw web data is scraped, versioned, transformed into features, used to train a validated model, exposed behind a live REST API, and monitored — all with reproducible pipelines and explicit contracts.


System at a glance

flowchart LR
    A["Scraping\nAirflow · Selenoid"] --> B["Storage\nPostgreSQL · MinIO"]
    B --> C["ML Pipeline\n22 DVC stages"]
    C --> D["Registry\nMLflow aliases"]
    D --> E["Serving\nFastAPI · Celery"]
    E --> F["Observability\nPrometheus · Grafana"]

Operational today: scraping → preprocessing → feature engineering → training → MLflow registry → FastAPI sync inference → Prometheus metrics → Evidently drift detection → Grafana dashboards. Planned: Async HTTP inference route (POST /predict/async/), automated retraining trigger.

See Implementation Status for the full readiness matrix. Full component map, flows, and production reality: System Overview.


ML lifecycle summary

Stage Tool Status
Data ingestion Airflow + Selenoid → PostgreSQL ✅ Operational
Data versioning DVC + MinIO (S3-compatible) ✅ Operational
Data validation Great Expectations (4 gates) ✅ Operational
Feature engineering Rolling stats, ELO, H2H, rest days ✅ Operational
Hyperparameter tuning Optuna (walk-forward CV) ✅ Operational
Experiment tracking MLflow (self-hosted) ✅ Operational
Model registry MLflow aliases: smoke → candidate → champion ✅ Operational
Sync inference FastAPI → Celery ml queue → Redis ✅ Operational
Async inference POST /predict/async/ + polling � Partial
Metrics export Prometheus /metrics (9 metrics) ✅ Operational
Drift detection Evidently (daily DAG + /monitoring/drift) ✅ Operational
ML quality monitor Airflow DAG → log-loss, ECE, hit-rate + Evidently HTML ✅ Operational
Dashboards Grafana (2 deployed: ML Quality & Betting, SoccerPredictAI) ✅ Operational
Automated retraining Airflow-triggered DVC repro 📋 Planned

Key numbers

Metric Value
Holdout log-loss 1.006 (bookmaker benchmark ~0.97)
Holdout ROC AUC 0.643 (random baseline 0.500)
Calibration (ECE) 0.004
Holdout set size 135 970 matches (2024+)
DVC pipeline stages 22
Test suite 560 passing tests
Infrastructure cost <€30/month (self-hosted single VPS)

Where to go next

Goal Page
What is built and what is not Implementation Status
Navigate by role or time budget Portfolio Overview
Reproduce the training pipeline locally Quickstart
Prepare for an interview demo Demo Walkthrough
Understand the system design System Overview
Understand key design decisions Architecture Trade-offs
Model performance evidence Portfolio Results
Analysis reports (evidence) Analysis Reports
Verify the system works Evidence