SoccerPredictAI
  • Home
  • Reports
    • 01 · EDA & Preprocessing
    • 02 · Feature Engineering
    • 03 · Experiment Studies v1.01–v1.05
    • 04 · Model Analysis
    • 05 · Holdout Analysis
    • 06 · Live Inference & Odds
    • 07 · Live Betting Strategy
  • Back to Docs (MkDocs)

On this page

  • Pipeline Overview
  • Stage 01 · EDA & Preprocessing
  • Stage 02 · Feature Engineering
  • Stage 03 · Experiment Studies
  • Stage 04 · Model Analysis
  • Stage 05 · Holdout Analysis
  • Stage 06 · Live Inference & Odds
  • Stage 07 · Live Betting Strategy
  • Quick Reference
  • How to regenerate

SoccerPredictAI — Pipeline Summary

End-to-end MLOps system for football match outcome prediction

Author

Dima Ivanov

Published

May 28, 2026

This page is the entry point for all SoccerPredictAI analysis reports. Each section below summarises one pipeline stage and links to the full detailed report. Re-run make docs-build to refresh all rendered outputs after dvc repro.


Pipeline Overview

flowchart TD
    A[load_data_from_sources] --> B[preprocessing]
    B --> C[feature_engineering]
    C --> D["classification_models<br/>218 runs · 7 studies"]
    D --> E["tune_xgb / tune_logreg / tune_hgb"]
    E --> F[select_model]
    F --> G["final_train + register_model"]
    G --> H["analysis<br/>holdout + ROI"]
    G --> I[batch_inference]
    I --> J["Fonbet Airflow DAGs<br/>3-stage odds pipeline"]
    J --> K["/predict/cards/ API"]

    style A fill:#e8f4fd,stroke:#4e79a7
    style B fill:#e8f4fd,stroke:#4e79a7
    style C fill:#e8fde8,stroke:#59a14f
    style D fill:#fef9e8,stroke:#f28e2b
    style E fill:#fde8f4,stroke:#b07aa1
    style F fill:#fde8f4,stroke:#b07aa1
    style G fill:#fde8f4,stroke:#b07aa1
    style H fill:#e8fde8,stroke:#59a14f
    style I fill:#fde8e8,stroke:#e15759
    style J fill:#fde8e8,stroke:#e15759
    style K fill:#fde8e8,stroke:#e15759


Stage 01 · EDA & Preprocessing

DVC stages: load_data_from_sources, preprocessing

Show code
_root = project_root
_paths01 = [
    (_root / "data" / "raw"     / "match_raw.parquet",  "raw matches"),
    (_root / "data" / "interim" / "finished.parquet",   "finished"),
    (_root / "data" / "interim" / "future.parquet",     "upcoming"),
]
_items01 = []
for _path, _label in _paths01:
    try:
        _items01.append((_label, f"{len(pd.read_parquet(_path)):,}"))
    except Exception:
        pass

if _items01:
    display(HTML(_metric_row(_items01)))
else:
    display(Markdown("*Artifacts not found — run `dvc repro preprocessing` first.*"))

raw matches: 993,490 finished: 972,233 upcoming: 2,712

Explores the raw data ingested from external sources: temporal span, target class balance, league/region coverage, and all transformations applied during preprocessing (score filtering, duplicate removal, future/finished split).

Key questions: How many matches are available? What is the temporal coverage? How balanced is the 1X2 target?

→ Full report: EDA & Preprocessing


Stage 02 · Feature Engineering

DVC stage: feature_engineering

Show code
_feat_path = _root / "data" / "features" / "features.parquet"

try:
    _df_feat = pd.read_parquet(_feat_path)
    display(HTML(_metric_row([
        ("rows",      f"{len(_df_feat):,}"),
        ("features",  str(_df_feat.shape[1])),
        ("missing %", f"{_df_feat.isnull().mean().mean() * 100:.1f}%"),
    ])))
except Exception:
    display(Markdown("*Artifacts not found — run `dvc repro feature_engineering` first.*"))

rows: 972,233 features: 507 missing %: 2.3%

Builds the offline feature matrix:

  • ELO ratings — per-tournament team strength (K=32, initial=1500, home advantage=50)
  • Rolling statistics — win/draw/loss rate, goals for/against over windows [1, 2, 3, 5, 7, 10, 12]
  • Representation variants — home, away, diff encodings per rolling stat
  • Categorical — sex column

Key questions: Which features carry the most signal? How complete is the feature matrix at different window sizes?

→ Full report: Feature Engineering


Stage 03 · Experiment Studies

DVC stage: classification_models · Branches: experiment/study_v1.01–v1.05

Show code
_runs_csv = _root / "_temp" / "runs.csv"

try:
    import csv
    _rows_csv = list(csv.DictReader(open(_runs_csv)))
    _finished03 = [r for r in _rows_csv if r.get("Status", "").upper() == "FINISHED"]
    _df_runs = pd.DataFrame(_finished03)

    _items03 = [("total runs", str(len(_finished03)))]
    if "holdout.logloss" in _df_runs.columns:
        _df_runs["holdout.logloss"] = pd.to_numeric(_df_runs["holdout.logloss"], errors="coerce")
        _best03 = _df_runs.loc[_df_runs["holdout.logloss"].idxmin()]
        _items03.append(("best logloss", f'{_best03["holdout.logloss"]:.4f}'))
        if "model.name" in _df_runs.columns:
            _items03.append(("best model", str(_best03.get("model.name", "—"))))
    display(HTML(_metric_row(_items03)))
except Exception:
    display(Markdown(
        "*218 FINISHED runs across 7 studies (v1.01–v1.05). "
        "CSV cache not found — connect to MLflow.*"
    ))

218 FINISHED runs across 7 studies (v1.01–v1.05). CSV cache not found — connect to MLflow.

Systematic exploration across 7 studies to identify the best training configuration:

Study Focus Conclusion
0.1 Initial sweep Establishes baseline across all models
0.2 Learning curve SGD needs ≥10% data; tree models stable at 1%
1 Window sizes Wider windows reduce logloss; [1,2,3,5,10] is solid
2 Class weights Draw up-weighting (1.25×) slightly improves balance
3 Side representation diff encoding adds signal on top of home/away
4 Feature ablation ELO dominates; rolling stats add marginal gain
5 Window extension Windows 7 and 12 give additional logloss reduction

Key question: Which model class and config should proceed to Optuna tuning?

→ Full report: Experiment Studies


Stage 04 · Model Analysis

DVC stages: tune_xgb, tune_logreg, tune_hgb, select_model, final_train

Show code
try:
    import os
    import mlflow
    from src.pipelines._config import get_pipeline_config
    _cfg04 = get_pipeline_config()
    os.environ.setdefault("MLFLOW_S3_ENDPOINT_URL", _cfg04.minio_endpoint_url)
    os.environ.setdefault("AWS_ACCESS_KEY_ID",      _cfg04.minio_access_key)
    os.environ.setdefault("AWS_SECRET_ACCESS_KEY",  _cfg04.minio_secret_key)
    mlflow.set_tracking_uri(_cfg04.mlflow_tracking_uri)
    _client04   = mlflow.tracking.MlflowClient()
    _model_name = PARAMS.get("inference", {}).get("model_name", "soccer-match-outcome")
    _alias      = PARAMS.get("inference", {}).get("model_stage", "champion")
    _mv04       = _client04.get_model_version_by_alias(_model_name, _alias)
    _champ04    = _client04.get_run(_mv04.run_id)
    _p04 = _champ04.data.params
    _m04 = _champ04.data.metrics
    display(HTML(_metric_row([
        ("champion",    str(_p04.get("model.name", "—"))),
        ("version",     str(_mv04.version)),
        ("logloss",     f'{float(_m04.get("final.logloss", _m04.get("holdout.logloss", 0))):.4f}'),
        ("accuracy",    f'{float(_m04.get("final.accuracy", _m04.get("holdout.accuracy", 0))):.3f}'),
        ("calibration", str(_p04.get("calibration.method", "isotonic"))),
    ])))
except Exception:
    display(Markdown(
        "*Champion model: **XGBoost** with isotonic calibration, selected by holdout logloss.*\n\n"
        "*Connect to MLflow to see live metrics.*"
    ))

champion: xgb version: 10 logloss: 1.0043 accuracy: 0.504 calibration: isotonic

Full champion model lifecycle:

  • Optuna tuning — 20 trials each for XGBoost, Logistic Regression, HistGradientBoosting
  • Model selection — best trial per model class compared on holdout logloss
  • Final training — champion retrained on full train+val with isotonic calibration (15% calib fraction)
  • Registration — soccer-match-outcome → @champion in MLflow Model Registry

Key questions: How well-calibrated are the probabilities? Which features drive predictions (SHAP)?

→ Full report: Model Analysis


Stage 05 · Holdout Analysis

DVC stage: analysis

Show code
_err_dir05 = _root / "data" / "analysis" / "error"
_roi_dir05 = _root / "data" / "analysis" / "roi"

try:
    _df_err05 = pd.read_csv(_err_dir05 / "roi_simulation.csv")
    _df_roi05 = pd.read_csv(_roi_dir05 / "roi_simulation.csv")

    _items05 = []

    # error metrics — uniform betting baseline
    for _col, _fmt in [("roi_pct", "{:.2f}%"), ("hit_rate", "{:.1%}"), ("bet_rate", "{:.1%}")]:
        if _col in _df_err05.columns:
            _v = pd.to_numeric(_df_err05[_col], errors="coerce").dropna()
            if len(_v):
                _items05.append((f"err/{_col}", _fmt.format(_v.iloc[0])))

    # ROI metrics — edge-filtered betting
    for _col, _fmt in [("roi_pct", "{:.2f}%"), ("n_bets", "{:.0f}"), ("hit_rate", "{:.1%}")]:
        if _col in _df_roi05.columns:
            _v = pd.to_numeric(_df_roi05[_col], errors="coerce").dropna()
            if len(_v):
                _items05.append((f"roi/{_col}", _fmt.format(_v.iloc[0])))

    if _items05:
        display(HTML(_metric_row(_items05)))
    else:
        raise ValueError("no matching columns")
except Exception:
    display(Markdown(
        "*Error analysis on holdout set + ROI simulation vs Bet365 closing odds.*\n\n"
        "*Run `dvc repro analysis` to generate artifacts.*"
    ))

err/roi_pct: -0.70% err/hit_rate: 50.4% err/bet_rate: 100.0% roi/roi_pct: -11.47% roi/n_bets: 13317 roi/hit_rate: 34.7%

Evaluates the registered champion on the temporal holdout set (test_start: 2024-01-01):

  • Error analysis — confusion matrix, per-class precision/recall, season/region/ELO-gap breakdowns
  • ROI simulation — flat-stake betting vs Bet365 closing odds at various edge thresholds
  • Benchmark — model ROI vs. naive Bet365 market return

Key questions: Does the model have positive ROI at realistic edge thresholds? Where does it fail?

→ Full report: Holdout Analysis


Stage 06 · Live Inference & Odds

DVC stage: batch_inference · Airflow: soccer_etl_odds_fonbet_01/02/03

Show code
_inf_dir = _root / "data" / "predictions"

try:
    _pq_files06 = sorted(_inf_dir.glob("*.parquet"))
    if not _pq_files06:
        raise FileNotFoundError
    _df_inf = pd.concat([pd.read_parquet(f) for f in _pq_files06], ignore_index=True)
    _items06 = [("predictions", f"{len(_df_inf):,}")]
    _dcol = next((c for c in ["match_date", "date"] if c in _df_inf.columns), None)
    if _dcol:
        _mx = pd.to_datetime(_df_inf[_dcol], errors="coerce").max()
        if pd.notna(_mx):
            _items06.append(("latest match", str(_mx.date())))
    display(HTML(_metric_row(_items06)))
except Exception:
    display(Markdown("*Run `dvc repro batch_inference` to generate prediction artifacts.*"))

predictions: 368,758

Covers the online path from trained model to live predictions:

  • batch_inference — DVC stage that scores all upcoming matches using @champion
  • Fonbet scraping pipeline — 3-stage Airflow DAG chain (raw scrape → fuzzy match → factor extraction), runs every 4 h since 2026-05-15
  • /predict/cards/ API — FastAPI endpoint merging model probabilities with live Fonbet odds
Component Type Schedule
batch_inference DVC stage on-demand / after final_train
soccer_etl_odds_fonbet_01_raw Airflow DAG every 4 h
soccer_etl_odds_fonbet_02_link Airflow DAG triggered after DAG 01
soccer_etl_odds_fonbet_03_odds Airflow DAG triggered after DAG 02

Key questions: What does the model predict for upcoming matches? How do model probs compare to live bookmaker odds?

→ Full report: Live Inference & Odds


Stage 07 · Live Betting Strategy

Source: batch_inference predictions × Fonbet live odds

Show code
_live_dir = _root / "data" / "analysis" / "live_betting"

try:
    _candidates07 = list(_live_dir.glob("*.parquet")) + list(_live_dir.glob("*.csv"))
    if not _candidates07:
        raise FileNotFoundError

    _dfs07 = []
    for _f in _candidates07[:5]:
        try:
            _dfs07.append(pd.read_parquet(_f) if _f.suffix == ".parquet" else pd.read_csv(_f))
        except Exception:
            pass

    if _dfs07:
        _df_live = pd.concat(_dfs07, ignore_index=True)
        _items07 = [("records", f"{len(_df_live):,}")]
        for _col in ["pnl", "profit", "roi", "kelly_pnl"]:
            _cands = [c for c in _df_live.columns if _col in c.lower()]
            if _cands:
                _val = pd.to_numeric(_df_live[_cands[0]], errors="coerce").sum()
                _items07.append((_col, f"{_val:.2f}"))
        display(HTML(_metric_row(_items07)))
    else:
        raise ValueError
except Exception:
    display(Markdown("*Run the live betting analysis after collecting Fonbet odds.*"))

Run the live betting analysis after collecting Fonbet odds.

Simulates betting strategies using model edge against live Fonbet pre-match odds:

  • Flat-stake — fixed 1-unit bet at various edge thresholds (0%–15%)
  • Fractional Kelly — bet size proportional to estimated edge (½-Kelly, ¼-Kelly)
  • ROI by region — per-tournament/region breakdown reveals large variance: some regions yield positive ROI, others negative
  • Strategy — filter out negative-ROI regions, concentrate bets where the model has consistent edge

Key finding: ROI varies significantly across regions. By dropping regions with negative ROI and retaining only those with positive edge against Fonbet odds, the overall strategy P&L improves substantially. Region-level filtering is the primary lever for improving live betting returns.

→ Full report: Live Betting Strategy


Quick Reference

# Report DVC stage(s) Status
01 EDA & Preprocessing load_data_from_sources, preprocessing ✅ Implemented
02 Feature Engineering feature_engineering ✅ Implemented
03 Experiment Studies v1.01–v1.05 classification_models ✅ Implemented
04 Model Analysis tune_xgb, tune_logreg, tune_hgb, select_model, final_train ✅ Implemented
05 Holdout Analysis analysis ✅ Implemented
06 Live Inference & Odds batch_inference ✅ Implemented
07 Live Betting Strategy batch_inference (standalone) ✅ Implemented

How to regenerate

# Run the full pipeline then rebuild all docs
dvc repro
make docs-build

Reports use freeze: auto — cells are re-executed only when their source changes. Individual reports can be rendered with:

quarto render reports/qmd/01_eda_and_preprocessing.qmd