SoccerPredictAI — End-to-End MLOps System¶
SoccerPredictAI is a production-style end-to-end MLOps platform for football match prediction. Raw web data is scraped, versioned, transformed into features, used to train a validated model, exposed behind a live REST API, and monitored — all with reproducible pipelines and explicit contracts.
System at a glance¶
flowchart LR
A["Scraping\nAirflow · Selenoid"] --> B["Storage\nPostgreSQL · MinIO"]
B --> C["ML Pipeline\n22 DVC stages"]
C --> D["Registry\nMLflow aliases"]
D --> E["Serving\nFastAPI · Celery"]
E --> F["Observability\nPrometheus · Grafana"]
Operational today: scraping → preprocessing → feature engineering → training → MLflow registry → FastAPI sync inference → Prometheus metrics → Evidently drift detection → Grafana dashboards.
Planned: Async HTTP inference route (POST /predict/async/), automated retraining trigger.
See Implementation Status for the full readiness matrix. Full component map, flows, and production reality: System Overview.
ML lifecycle summary¶
| Stage | Tool | Status |
|---|---|---|
| Data ingestion | Airflow + Selenoid → PostgreSQL | ✅ Operational |
| Data versioning | DVC + MinIO (S3-compatible) | ✅ Operational |
| Data validation | Great Expectations (4 gates) | ✅ Operational |
| Feature engineering | Rolling stats, ELO, H2H, rest days | ✅ Operational |
| Hyperparameter tuning | Optuna (walk-forward CV) | ✅ Operational |
| Experiment tracking | MLflow (self-hosted) | ✅ Operational |
| Model registry | MLflow aliases: smoke → candidate → champion | ✅ Operational |
| Sync inference | FastAPI → Celery ml queue → Redis |
✅ Operational |
| Async inference | POST /predict/async/ + polling |
� Partial |
| Metrics export | Prometheus /metrics (9 metrics) |
✅ Operational |
| Drift detection | Evidently (daily DAG + /monitoring/drift) |
✅ Operational |
| ML quality monitor | Airflow DAG → log-loss, ECE, hit-rate + Evidently HTML | ✅ Operational |
| Dashboards | Grafana (2 deployed: ML Quality & Betting, SoccerPredictAI) | ✅ Operational |
| Automated retraining | Airflow-triggered DVC repro | 📋 Planned |
Key numbers¶
| Metric | Value |
|---|---|
| Holdout log-loss | 1.006 (bookmaker benchmark ~0.97) |
| Holdout ROC AUC | 0.643 (random baseline 0.500) |
| Calibration (ECE) | 0.004 |
| Holdout set size | 135 970 matches (2024+) |
| DVC pipeline stages | 22 |
| Test suite | 560 passing tests |
| Infrastructure cost | <€30/month (self-hosted single VPS) |
Where to go next¶
| Goal | Page |
|---|---|
| What is built and what is not | Implementation Status |
| Navigate by role or time budget | Portfolio Overview |
| Reproduce the training pipeline locally | Quickstart |
| Prepare for an interview demo | Demo Walkthrough |
| Understand the system design | System Overview |
| Understand key design decisions | Architecture Trade-offs |
| Model performance evidence | Portfolio Results |
| Analysis reports (evidence) | Analysis Reports |
| Verify the system works | Evidence |