SoccerPredictAI — Pipeline Summary

End-to-end MLOps system for football match outcome prediction

Author

Dima Ivanov

Published

May 29, 2026

This page is the entry point for all SoccerPredictAI analysis reports. Each section below summarises one pipeline stage and links to the full detailed report. Re-run make docs-build to refresh all rendered outputs after dvc repro.

Pipeline Overview

flowchart TD
    A[load_data_from_sources] --> B[preprocessing]
    B --> C[feature_engineering]
    C --> D["classification_models<br/>218 runs · 7 studies"]
    D --> E["tune_xgb / tune_logreg / tune_hgb"]
    E --> F[select_model]
    F --> G["final_train + register_model"]
    G --> H["analysis<br/>holdout + ROI"]
    G --> I[batch_inference]
    I --> J["Fonbet Airflow DAGs<br/>3-stage odds pipeline"]
    J --> K["/predict/cards/ API"]

    style A fill:#e8f4fd,stroke:#4e79a7
    style B fill:#e8f4fd,stroke:#4e79a7
    style C fill:#e8fde8,stroke:#59a14f
    style D fill:#fef9e8,stroke:#f28e2b
    style E fill:#fde8f4,stroke:#b07aa1
    style F fill:#fde8f4,stroke:#b07aa1
    style G fill:#fde8f4,stroke:#b07aa1
    style H fill:#e8fde8,stroke:#59a14f
    style I fill:#fde8e8,stroke:#e15759
    style J fill:#fde8e8,stroke:#e15759
    style K fill:#fde8e8,stroke:#e15759

Stage 01 · EDA & Preprocessing

DVC stages: load_data_from_sources, preprocessing

Show code

_root = project_root
_paths01 = [
    (_root / "data" / "raw"     / "match_raw.parquet",  "raw matches"),
    (_root / "data" / "interim" / "finished.parquet",   "finished"),
    (_root / "data" / "interim" / "future.parquet",     "upcoming"),
]
_items01 = []
for _path, _label in _paths01:
    try:
        _items01.append((_label, f"{len(pd.read_parquet(_path)):,}"))
    except Exception:
        pass

if _items01:
    display(HTML(_metric_row(_items01)))
else:
    display(Markdown("*Artifacts not found — run `dvc repro preprocessing` first.*"))

raw matches: 993,490 finished: 972,233 upcoming: 2,712

Explores the raw data ingested from external sources: temporal span, target class balance, league/region coverage, and all transformations applied during preprocessing (score filtering, duplicate removal, future/finished split).

Key questions: How many matches are available? What is the temporal coverage? How balanced is the 1X2 target?

→ Full report: EDA & Preprocessing

Stage 02 · Feature Engineering

DVC stage: feature_engineering

Show code

_feat_path = _root / "data" / "features" / "features.parquet"

try:
    _df_feat = pd.read_parquet(_feat_path)
    display(HTML(_metric_row([
        ("rows",      f"{len(_df_feat):,}"),
        ("features",  str(_df_feat.shape[1])),
        ("missing %", f"{_df_feat.isnull().mean().mean() * 100:.1f}%"),
    ])))
except Exception:
    display(Markdown("*Artifacts not found — run `dvc repro feature_engineering` first.*"))

rows: 972,233 features: 507 missing %: 2.3%

Builds the offline feature matrix:

ELO ratings — per-tournament team strength (K=32, initial=1500, home advantage=50)
Rolling statistics — win/draw/loss rate, goals for/against over windows [1, 2, 3, 5, 7, 10, 12]
Representation variants — home, away, diff encodings per rolling stat
Categorical — sex column

Key questions: Which features carry the most signal? How complete is the feature matrix at different window sizes?

→ Full report: Feature Engineering

Stage 03 · Experiment Studies

DVC stage: classification_models · Branches: experiment/study_v1.01–v1.05

Show code

_runs_csv = _root / "_temp" / "runs.csv"

try:
    import csv
    _rows_csv = list(csv.DictReader(open(_runs_csv)))
    _finished03 = [r for r in _rows_csv if r.get("Status", "").upper() == "FINISHED"]
    _df_runs = pd.DataFrame(_finished03)

    _items03 = [("total runs", str(len(_finished03)))]
    if "holdout.logloss" in _df_runs.columns:
        _df_runs["holdout.logloss"] = pd.to_numeric(_df_runs["holdout.logloss"], errors="coerce")
        _best03 = _df_runs.loc[_df_runs["holdout.logloss"].idxmin()]
        _items03.append(("best logloss", f'{_best03["holdout.logloss"]:.4f}'))
        if "model.name" in _df_runs.columns:
            _items03.append(("best model", str(_best03.get("model.name", "—"))))
    display(HTML(_metric_row(_items03)))
except Exception:
    display(Markdown(
        "*218 FINISHED runs across 7 studies (v1.01–v1.05). "
        "CSV cache not found — connect to MLflow.*"
    ))

218 FINISHED runs across 7 studies (v1.01–v1.05). CSV cache not found — connect to MLflow.

Systematic exploration across 7 studies to identify the best training configuration:

Study	Focus	Conclusion
0.1	Initial sweep	Establishes baseline across all models
0.2	Learning curve	SGD needs ≥10% data; tree models stable at 1%
1	Window sizes	Wider windows reduce logloss; [1,2,3,5,10] is solid
2	Class weights	Draw up-weighting (1.25×) slightly improves balance
3	Side representation	`diff` encoding adds signal on top of `home`/`away`
4	Feature ablation	ELO dominates; rolling stats add marginal gain
5	Window extension	Windows 7 and 12 give additional logloss reduction

Key question: Which model class and config should proceed to Optuna tuning?

→ Full report: Experiment Studies

Stage 04 · Model Analysis

DVC stages: tune_xgb, tune_logreg, tune_hgb, select_model, final_train

Show code

try:
    import os
    import mlflow
    from src.pipelines._config import get_pipeline_config
    _cfg04 = get_pipeline_config()
    os.environ.setdefault("MLFLOW_S3_ENDPOINT_URL", _cfg04.minio_endpoint_url)
    os.environ.setdefault("AWS_ACCESS_KEY_ID",      _cfg04.minio_access_key)
    os.environ.setdefault("AWS_SECRET_ACCESS_KEY",  _cfg04.minio_secret_key)
    mlflow.set_tracking_uri(_cfg04.mlflow_tracking_uri)
    _client04   = mlflow.tracking.MlflowClient()
    _model_name = PARAMS.get("inference", {}).get("model_name", "soccer-match-outcome")
    _alias      = PARAMS.get("inference", {}).get("model_stage", "champion")
    _mv04       = _client04.get_model_version_by_alias(_model_name, _alias)
    _champ04    = _client04.get_run(_mv04.run_id)
    _p04 = _champ04.data.params
    _m04 = _champ04.data.metrics
    display(HTML(_metric_row([
        ("champion",    str(_p04.get("model.name", "—"))),
        ("version",     str(_mv04.version)),
        ("logloss",     f'{float(_m04.get("final.logloss", _m04.get("holdout.logloss", 0))):.4f}'),
        ("accuracy",    f'{float(_m04.get("final.accuracy", _m04.get("holdout.accuracy", 0))):.3f}'),
        ("calibration", str(_p04.get("calibration.method", "isotonic"))),
    ])))
except Exception:
    display(Markdown(
        "*Champion model: **XGBoost** with isotonic calibration, selected by holdout logloss.*\n\n"
        "*Connect to MLflow to see live metrics.*"
    ))

champion: xgb version: 10 logloss: 1.0043 accuracy: 0.504 calibration: isotonic

Full champion model lifecycle:

Optuna tuning — 20 trials each for XGBoost, Logistic Regression, HistGradientBoosting
Model selection — best trial per model class compared on holdout logloss
Final training — champion retrained on full train+val with isotonic calibration (15% calib fraction)
Registration — soccer-match-outcome → @champion in MLflow Model Registry

Key questions: How well-calibrated are the probabilities? Which features drive predictions (SHAP)?

→ Full report: Model Analysis

Stage 05 · Holdout Analysis

DVC stage: analysis

Show code

_err_dir05 = _root / "data" / "analysis" / "error"
_roi_dir05 = _root / "data" / "analysis" / "roi"

try:
    _df_err05 = pd.read_csv(_err_dir05 / "roi_simulation.csv")
    _df_roi05 = pd.read_csv(_roi_dir05 / "roi_simulation.csv")

    _items05 = []

    # error metrics — uniform betting baseline
    for _col, _fmt in [("roi_pct", "{:.2f}%"), ("hit_rate", "{:.1%}"), ("bet_rate", "{:.1%}")]:
        if _col in _df_err05.columns:
            _v = pd.to_numeric(_df_err05[_col], errors="coerce").dropna()
            if len(_v):
                _items05.append((f"err/{_col}", _fmt.format(_v.iloc[0])))

    # ROI metrics — edge-filtered betting
    for _col, _fmt in [("roi_pct", "{:.2f}%"), ("n_bets", "{:.0f}"), ("hit_rate", "{:.1%}")]:
        if _col in _df_roi05.columns:
            _v = pd.to_numeric(_df_roi05[_col], errors="coerce").dropna()
            if len(_v):
                _items05.append((f"roi/{_col}", _fmt.format(_v.iloc[0])))

    if _items05:
        display(HTML(_metric_row(_items05)))
    else:
        raise ValueError("no matching columns")
except Exception:
    display(Markdown(
        "*Error analysis on holdout set + ROI simulation vs Bet365 closing odds.*\n\n"
        "*Run `dvc repro analysis` to generate artifacts.*"
    ))

err/roi_pct: -0.70% err/hit_rate: 50.4% err/bet_rate: 100.0% roi/roi_pct: -11.47% roi/n_bets: 13317 roi/hit_rate: 34.7%

Evaluates the registered champion on the temporal holdout set (test_start: 2024-01-01):

Error analysis — confusion matrix, per-class precision/recall, season/region/ELO-gap breakdowns
ROI simulation — flat-stake betting vs Bet365 closing odds at various edge thresholds
Benchmark — model ROI vs. naive Bet365 market return

Key questions: Does the model have positive ROI at realistic edge thresholds? Where does it fail?

→ Full report: Holdout Analysis

Stage 06 · Live Inference & Odds

DVC stage: batch_inference · Airflow: soccer_etl_odds_fonbet_01/02/03

Show code

_inf_dir = _root / "data" / "predictions"

try:
    _pq_files06 = sorted(_inf_dir.glob("*.parquet"))
    if not _pq_files06:
        raise FileNotFoundError
    _df_inf = pd.concat([pd.read_parquet(f) for f in _pq_files06], ignore_index=True)
    _items06 = [("predictions", f"{len(_df_inf):,}")]
    _dcol = next((c for c in ["match_date", "date"] if c in _df_inf.columns), None)
    if _dcol:
        _mx = pd.to_datetime(_df_inf[_dcol], errors="coerce").max()
        if pd.notna(_mx):
            _items06.append(("latest match", str(_mx.date())))
    display(HTML(_metric_row(_items06)))
except Exception:
    display(Markdown("*Run `dvc repro batch_inference` to generate prediction artifacts.*"))

predictions: 368,758

Covers the online path from trained model to live predictions:

batch_inference — DVC stage that scores all upcoming matches using @champion
Fonbet scraping pipeline — 3-stage Airflow DAG chain (raw scrape → fuzzy match → factor extraction), runs every 4 h since 2026-05-15
/predict/cards/ API — FastAPI endpoint merging model probabilities with live Fonbet odds

Component	Type	Schedule
`batch_inference`	DVC stage	on-demand / after `final_train`
`soccer_etl_odds_fonbet_01_raw`	Airflow DAG	every 4 h
`soccer_etl_odds_fonbet_02_link`	Airflow DAG	triggered after DAG 01
`soccer_etl_odds_fonbet_03_odds`	Airflow DAG	triggered after DAG 02

Key questions: What does the model predict for upcoming matches? How do model probs compare to live bookmaker odds?

→ Full report: Live Inference & Odds

Stage 07 · Live Betting Strategy

Source: batch_inference predictions × Fonbet live odds

Show code

_live_dir = _root / "data" / "analysis" / "live_betting"

try:
    _candidates07 = list(_live_dir.glob("*.parquet")) + list(_live_dir.glob("*.csv"))
    if not _candidates07:
        raise FileNotFoundError

    _dfs07 = []
    for _f in _candidates07[:5]:
        try:
            _dfs07.append(pd.read_parquet(_f) if _f.suffix == ".parquet" else pd.read_csv(_f))
        except Exception:
            pass

    if _dfs07:
        _df_live = pd.concat(_dfs07, ignore_index=True)
        _items07 = [("records", f"{len(_df_live):,}")]
        for _col in ["pnl", "profit", "roi", "kelly_pnl"]:
            _cands = [c for c in _df_live.columns if _col in c.lower()]
            if _cands:
                _val = pd.to_numeric(_df_live[_cands[0]], errors="coerce").sum()
                _items07.append((_col, f"{_val:.2f}"))
        display(HTML(_metric_row(_items07)))
    else:
        raise ValueError
except Exception:
    display(Markdown("*Run the live betting analysis after collecting Fonbet odds.*"))

Run the live betting analysis after collecting Fonbet odds.

Simulates betting strategies using model edge against live Fonbet pre-match odds:

Flat-stake — fixed 1-unit bet at various edge thresholds (0%–15%)
Fractional Kelly — bet size proportional to estimated edge (½-Kelly, ¼-Kelly)
ROI by region — per-tournament/region breakdown reveals large variance: some regions yield positive ROI, others negative
Strategy — filter out negative-ROI regions, concentrate bets where the model has consistent edge

Key finding: ROI varies significantly across regions. By dropping regions with negative ROI and retaining only those with positive edge against Fonbet odds, the overall strategy P&L improves substantially. Region-level filtering is the primary lever for improving live betting returns.

→ Full report: Live Betting Strategy

Quick Reference

#	Report	DVC stage(s)	Status
01	EDA & Preprocessing	`load_data_from_sources`, `preprocessing`	✅ Implemented
02	Feature Engineering	`feature_engineering`	✅ Implemented
03	Experiment Studies v1.01–v1.05	`classification_models`	✅ Implemented
04	Model Analysis	`tune_xgb`, `tune_logreg`, `tune_hgb`, `select_model`, `final_train`	✅ Implemented
05	Holdout Analysis	`analysis`	✅ Implemented
06	Live Inference & Odds	`batch_inference`	✅ Implemented
07	Live Betting Strategy	`batch_inference` (standalone)	✅ Implemented

How to regenerate

# Run the full pipeline then rebuild all docs
dvc repro
make docs-build

Reports use freeze: auto — cells are re-executed only when their source changes. Individual reports can be rendered with:

quarto render reports/qmd/01_eda_and_preprocessing.qmd