flowchart TD
A[load_data_from_sources] --> B[preprocessing]
B --> C[feature_engineering]
C --> D["classification_models<br/>218 runs · 7 studies"]
D --> E["tune_xgb / tune_logreg / tune_hgb"]
E --> F[select_model]
F --> G["final_train + register_model"]
G --> H["analysis<br/>holdout + ROI"]
G --> I[batch_inference]
I --> J["Fonbet Airflow DAGs<br/>3-stage odds pipeline"]
J --> K["/predict/cards/ API"]
style A fill:#e8f4fd,stroke:#4e79a7
style B fill:#e8f4fd,stroke:#4e79a7
style C fill:#e8fde8,stroke:#59a14f
style D fill:#fef9e8,stroke:#f28e2b
style E fill:#fde8f4,stroke:#b07aa1
style F fill:#fde8f4,stroke:#b07aa1
style G fill:#fde8f4,stroke:#b07aa1
style H fill:#e8fde8,stroke:#59a14f
style I fill:#fde8e8,stroke:#e15759
style J fill:#fde8e8,stroke:#e15759
style K fill:#fde8e8,stroke:#e15759
SoccerPredictAI — Pipeline Summary
End-to-end MLOps system for football match outcome prediction
This page is the entry point for all SoccerPredictAI analysis reports. Each section below summarises one pipeline stage and links to the full detailed report. Re-run make docs-build to refresh all rendered outputs after dvc repro.
Pipeline Overview
Stage 01 · EDA & Preprocessing
DVC stages: load_data_from_sources, preprocessing
Show code
_root = project_root
_paths01 = [
(_root / "data" / "raw" / "match_raw.parquet", "raw matches"),
(_root / "data" / "interim" / "finished.parquet", "finished"),
(_root / "data" / "interim" / "future.parquet", "upcoming"),
]
_items01 = []
for _path, _label in _paths01:
try:
_items01.append((_label, f"{len(pd.read_parquet(_path)):,}"))
except Exception:
pass
if _items01:
display(HTML(_metric_row(_items01)))
else:
display(Markdown("*Artifacts not found — run `dvc repro preprocessing` first.*"))raw matches: 993,490 finished: 972,233 upcoming: 2,712
Explores the raw data ingested from external sources: temporal span, target class balance, league/region coverage, and all transformations applied during preprocessing (score filtering, duplicate removal, future/finished split).
→ Full report: EDA & Preprocessing
Stage 02 · Feature Engineering
DVC stage: feature_engineering
Show code
_feat_path = _root / "data" / "features" / "features.parquet"
try:
_df_feat = pd.read_parquet(_feat_path)
display(HTML(_metric_row([
("rows", f"{len(_df_feat):,}"),
("features", str(_df_feat.shape[1])),
("missing %", f"{_df_feat.isnull().mean().mean() * 100:.1f}%"),
])))
except Exception:
display(Markdown("*Artifacts not found — run `dvc repro feature_engineering` first.*"))rows: 972,233 features: 507 missing %: 2.3%
Builds the offline feature matrix:
- ELO ratings — per-tournament team strength (K=32, initial=1500, home advantage=50)
- Rolling statistics — win/draw/loss rate, goals for/against over windows
[1, 2, 3, 5, 7, 10, 12] - Representation variants —
home,away,diffencodings per rolling stat - Categorical —
sexcolumn
→ Full report: Feature Engineering
Stage 03 · Experiment Studies
DVC stage: classification_models · Branches: experiment/study_v1.01–v1.05
Show code
_runs_csv = _root / "_temp" / "runs.csv"
try:
import csv
_rows_csv = list(csv.DictReader(open(_runs_csv)))
_finished03 = [r for r in _rows_csv if r.get("Status", "").upper() == "FINISHED"]
_df_runs = pd.DataFrame(_finished03)
_items03 = [("total runs", str(len(_finished03)))]
if "holdout.logloss" in _df_runs.columns:
_df_runs["holdout.logloss"] = pd.to_numeric(_df_runs["holdout.logloss"], errors="coerce")
_best03 = _df_runs.loc[_df_runs["holdout.logloss"].idxmin()]
_items03.append(("best logloss", f'{_best03["holdout.logloss"]:.4f}'))
if "model.name" in _df_runs.columns:
_items03.append(("best model", str(_best03.get("model.name", "—"))))
display(HTML(_metric_row(_items03)))
except Exception:
display(Markdown(
"*218 FINISHED runs across 7 studies (v1.01–v1.05). "
"CSV cache not found — connect to MLflow.*"
))218 FINISHED runs across 7 studies (v1.01–v1.05). CSV cache not found — connect to MLflow.
Systematic exploration across 7 studies to identify the best training configuration:
| Study | Focus | Conclusion |
|---|---|---|
| 0.1 | Initial sweep | Establishes baseline across all models |
| 0.2 | Learning curve | SGD needs ≥10% data; tree models stable at 1% |
| 1 | Window sizes | Wider windows reduce logloss; [1,2,3,5,10] is solid |
| 2 | Class weights | Draw up-weighting (1.25×) slightly improves balance |
| 3 | Side representation | diff encoding adds signal on top of home/away |
| 4 | Feature ablation | ELO dominates; rolling stats add marginal gain |
| 5 | Window extension | Windows 7 and 12 give additional logloss reduction |
→ Full report: Experiment Studies
Stage 04 · Model Analysis
DVC stages: tune_xgb, tune_logreg, tune_hgb, select_model, final_train
Show code
try:
import os
import mlflow
from src.pipelines._config import get_pipeline_config
_cfg04 = get_pipeline_config()
os.environ.setdefault("MLFLOW_S3_ENDPOINT_URL", _cfg04.minio_endpoint_url)
os.environ.setdefault("AWS_ACCESS_KEY_ID", _cfg04.minio_access_key)
os.environ.setdefault("AWS_SECRET_ACCESS_KEY", _cfg04.minio_secret_key)
mlflow.set_tracking_uri(_cfg04.mlflow_tracking_uri)
_client04 = mlflow.tracking.MlflowClient()
_model_name = PARAMS.get("inference", {}).get("model_name", "soccer-match-outcome")
_alias = PARAMS.get("inference", {}).get("model_stage", "champion")
_mv04 = _client04.get_model_version_by_alias(_model_name, _alias)
_champ04 = _client04.get_run(_mv04.run_id)
_p04 = _champ04.data.params
_m04 = _champ04.data.metrics
display(HTML(_metric_row([
("champion", str(_p04.get("model.name", "—"))),
("version", str(_mv04.version)),
("logloss", f'{float(_m04.get("final.logloss", _m04.get("holdout.logloss", 0))):.4f}'),
("accuracy", f'{float(_m04.get("final.accuracy", _m04.get("holdout.accuracy", 0))):.3f}'),
("calibration", str(_p04.get("calibration.method", "isotonic"))),
])))
except Exception:
display(Markdown(
"*Champion model: **XGBoost** with isotonic calibration, selected by holdout logloss.*\n\n"
"*Connect to MLflow to see live metrics.*"
))champion: xgb version: 10 logloss: 1.0043 accuracy: 0.504 calibration: isotonic
Full champion model lifecycle:
- Optuna tuning — 20 trials each for XGBoost, Logistic Regression, HistGradientBoosting
- Model selection — best trial per model class compared on holdout logloss
- Final training — champion retrained on full train+val with isotonic calibration (15% calib fraction)
- Registration —
soccer-match-outcome→@championin MLflow Model Registry
Stage 05 · Holdout Analysis
DVC stage: analysis
Show code
_err_dir05 = _root / "data" / "analysis" / "error"
_roi_dir05 = _root / "data" / "analysis" / "roi"
try:
_df_err05 = pd.read_csv(_err_dir05 / "roi_simulation.csv")
_df_roi05 = pd.read_csv(_roi_dir05 / "roi_simulation.csv")
_items05 = []
# error metrics — uniform betting baseline
for _col, _fmt in [("roi_pct", "{:.2f}%"), ("hit_rate", "{:.1%}"), ("bet_rate", "{:.1%}")]:
if _col in _df_err05.columns:
_v = pd.to_numeric(_df_err05[_col], errors="coerce").dropna()
if len(_v):
_items05.append((f"err/{_col}", _fmt.format(_v.iloc[0])))
# ROI metrics — edge-filtered betting
for _col, _fmt in [("roi_pct", "{:.2f}%"), ("n_bets", "{:.0f}"), ("hit_rate", "{:.1%}")]:
if _col in _df_roi05.columns:
_v = pd.to_numeric(_df_roi05[_col], errors="coerce").dropna()
if len(_v):
_items05.append((f"roi/{_col}", _fmt.format(_v.iloc[0])))
if _items05:
display(HTML(_metric_row(_items05)))
else:
raise ValueError("no matching columns")
except Exception:
display(Markdown(
"*Error analysis on holdout set + ROI simulation vs Bet365 closing odds.*\n\n"
"*Run `dvc repro analysis` to generate artifacts.*"
))err/roi_pct: -0.70% err/hit_rate: 50.4% err/bet_rate: 100.0% roi/roi_pct: -11.47% roi/n_bets: 13317 roi/hit_rate: 34.7%
Evaluates the registered champion on the temporal holdout set (test_start: 2024-01-01):
- Error analysis — confusion matrix, per-class precision/recall, season/region/ELO-gap breakdowns
- ROI simulation — flat-stake betting vs Bet365 closing odds at various edge thresholds
- Benchmark — model ROI vs. naive Bet365 market return
→ Full report: Holdout Analysis
Stage 06 · Live Inference & Odds
DVC stage: batch_inference · Airflow: soccer_etl_odds_fonbet_01/02/03
Show code
_inf_dir = _root / "data" / "predictions"
try:
_pq_files06 = sorted(_inf_dir.glob("*.parquet"))
if not _pq_files06:
raise FileNotFoundError
_df_inf = pd.concat([pd.read_parquet(f) for f in _pq_files06], ignore_index=True)
_items06 = [("predictions", f"{len(_df_inf):,}")]
_dcol = next((c for c in ["match_date", "date"] if c in _df_inf.columns), None)
if _dcol:
_mx = pd.to_datetime(_df_inf[_dcol], errors="coerce").max()
if pd.notna(_mx):
_items06.append(("latest match", str(_mx.date())))
display(HTML(_metric_row(_items06)))
except Exception:
display(Markdown("*Run `dvc repro batch_inference` to generate prediction artifacts.*"))predictions: 368,758
Covers the online path from trained model to live predictions:
batch_inference— DVC stage that scores all upcoming matches using@champion- Fonbet scraping pipeline — 3-stage Airflow DAG chain (raw scrape → fuzzy match → factor extraction), runs every 4 h since 2026-05-15
/predict/cards/API — FastAPI endpoint merging model probabilities with live Fonbet odds
| Component | Type | Schedule |
|---|---|---|
batch_inference |
DVC stage | on-demand / after final_train |
soccer_etl_odds_fonbet_01_raw |
Airflow DAG | every 4 h |
soccer_etl_odds_fonbet_02_link |
Airflow DAG | triggered after DAG 01 |
soccer_etl_odds_fonbet_03_odds |
Airflow DAG | triggered after DAG 02 |
→ Full report: Live Inference & Odds
Stage 07 · Live Betting Strategy
Source: batch_inference predictions × Fonbet live odds
Show code
_live_dir = _root / "data" / "analysis" / "live_betting"
try:
_candidates07 = list(_live_dir.glob("*.parquet")) + list(_live_dir.glob("*.csv"))
if not _candidates07:
raise FileNotFoundError
_dfs07 = []
for _f in _candidates07[:5]:
try:
_dfs07.append(pd.read_parquet(_f) if _f.suffix == ".parquet" else pd.read_csv(_f))
except Exception:
pass
if _dfs07:
_df_live = pd.concat(_dfs07, ignore_index=True)
_items07 = [("records", f"{len(_df_live):,}")]
for _col in ["pnl", "profit", "roi", "kelly_pnl"]:
_cands = [c for c in _df_live.columns if _col in c.lower()]
if _cands:
_val = pd.to_numeric(_df_live[_cands[0]], errors="coerce").sum()
_items07.append((_col, f"{_val:.2f}"))
display(HTML(_metric_row(_items07)))
else:
raise ValueError
except Exception:
display(Markdown("*Run the live betting analysis after collecting Fonbet odds.*"))Run the live betting analysis after collecting Fonbet odds.
Simulates betting strategies using model edge against live Fonbet pre-match odds:
- Flat-stake — fixed 1-unit bet at various edge thresholds (0%–15%)
- Fractional Kelly — bet size proportional to estimated edge (½-Kelly, ¼-Kelly)
- ROI by region — per-tournament/region breakdown reveals large variance: some regions yield positive ROI, others negative
- Strategy — filter out negative-ROI regions, concentrate bets where the model has consistent edge
→ Full report: Live Betting Strategy
Quick Reference
| # | Report | DVC stage(s) | Status |
|---|---|---|---|
| 01 | EDA & Preprocessing | load_data_from_sources, preprocessing |
✅ Implemented |
| 02 | Feature Engineering | feature_engineering |
✅ Implemented |
| 03 | Experiment Studies v1.01–v1.05 | classification_models |
✅ Implemented |
| 04 | Model Analysis | tune_xgb, tune_logreg, tune_hgb, select_model, final_train |
✅ Implemented |
| 05 | Holdout Analysis | analysis |
✅ Implemented |
| 06 | Live Inference & Odds | batch_inference |
✅ Implemented |
| 07 | Live Betting Strategy | batch_inference (standalone) |
✅ Implemented |
How to regenerate
# Run the full pipeline then rebuild all docs
dvc repro
make docs-buildReports use freeze: auto — cells are re-executed only when their source changes. Individual reports can be rendered with:
quarto render reports/qmd/01_eda_and_preprocessing.qmd