SoccerPredictAI — End-to-End MLOps System¶

SoccerPredictAI is a production-style end-to-end MLOps platform for football match prediction. Raw web data is scraped, versioned, transformed into features, used to train a validated model, exposed behind a live REST API, and monitored — all with reproducible pipelines and explicit contracts.

System at a glance¶

flowchart LR
    A["Scraping\nAirflow · Selenoid"] --> B["Storage\nPostgreSQL · MinIO"]
    B --> C["ML Pipeline\n22 DVC stages"]
    C --> D["Registry\nMLflow aliases"]
    D --> E["Serving\nFastAPI · Celery"]
    E --> F["Observability\nPrometheus · Grafana"]

Operational today: scraping → preprocessing → feature engineering → training → MLflow registry → FastAPI sync inference → Prometheus metrics → Evidently drift detection → Grafana dashboards.

Planned: Async HTTP inference route (POST /predict/async/), automated retraining trigger.

See Implementation Status for the full readiness matrix.

Full component map, flows, and production reality: System Overview.

ML lifecycle summary¶

Stage	Tool	Status
Data ingestion	Airflow + Selenoid → PostgreSQL	✅ Operational
Data versioning	DVC + MinIO (S3-compatible)	✅ Operational
Data validation	Great Expectations (4 gates)	✅ Operational
Feature engineering	Rolling stats, ELO, H2H, rest days	✅ Operational
Hyperparameter tuning	Optuna (walk-forward CV)	✅ Operational
Experiment tracking	MLflow (self-hosted)	✅ Operational
Model registry	MLflow aliases: smoke → candidate → champion	✅ Operational
Sync inference	FastAPI → Celery `ml` queue → Redis	✅ Operational
Async inference	`POST /predict/async/`	📋 Planned
Metrics export	Prometheus `/metrics` (10 metrics)	✅ Operational
Drift detection	Evidently (daily DAG + `/monitoring/drift`)	✅ Operational
ML quality monitor	Airflow DAG → log-loss, ECE, hit-rate + Evidently HTML	✅ Operational
Dashboards	Grafana (2 deployed: ML Quality & Betting, SoccerPredictAI)	✅ Operational
Automated retraining	Airflow-triggered DVC repro	📋 Planned

Key numbers¶

Metric	Value
Holdout log-loss	1.006 (bookmaker benchmark ~0.97)
Holdout ROC AUC	0.643 (random baseline 0.500)
Calibration (ECE)	0.004
Holdout set size	135 970 matches (2024+)
DVC pipeline stages	22
Test suite	560 passing tests
Infrastructure cost	<€30/month (self-hosted single VPS)

Where to go next¶

Goal	Page
What is built and what is not	Implementation Status
Navigate by role or time budget	Portfolio Overview
Reproduce the training pipeline locally	Quickstart
Prepare for an interview demo	Demo Walkthrough
Understand the system design	System Overview
Understand key design decisions	Architecture Trade-offs
Model performance evidence	Portfolio Results
Analysis reports (evidence)	Analysis Reports
Verify the system works	Evidence