SoccerPredictAI
  • Home
  • Reports
    • 01 · EDA & Preprocessing
    • 02 · Feature Engineering
    • 03 · Experiment Studies v1.01–v1.05
    • 04 · Model Analysis
    • 05 · Holdout Analysis
    • 06 · Live Inference & Odds
    • 07 · Live Betting Strategy
  • Back to Docs (MkDocs)

On this page

  • Overview
  • Study 0.1 (v1.01) — Initial Exploration
  • Study 0.2 (v1.02) — Learning Curve
  • Study 1 (v1.03) — Window Sizes
  • Study 2 (v1.04) — Class Weights
  • Study 3 (v1.03) — Side Representation
  • Study 4 (v1.03) — Feature Ablation
  • Study 5 (v1.05) — Window Extension
  • Summary — Selected Configuration for Tuning & Final Training
  • Outcomes

Experiment Studies v1.01–v1.05 — Learning Curve, Window, Class Weight, Side, Ablation

218 FINISHED MLflow runs across 7 studies; best-config selection for tuning

Author

Dima Ivanov

Published

May 28, 2026

Overview

Branches: experiment/study_v1.01, experiment/study_v1.02, experiment/study_v1.03, experiment/study_v1.04, experiment/study_v1.05 Total runs: 218 FINISHED across 87 unique run names. Studies:

# Study Branch Hypothesis
0.1 Initial exploration v1.01 Single-config sweep across all models for size, side, window, ablation
0.2 Learning curve v1.02 Logloss scales with training data fraction
1 Window sizes (base) v1.03 Wider window set reduces logloss
2 Class weights v1.04 Up-weighting draw class (1) improves balanced accuracy
3 Side representation v1.03 Diff feature adds signal on top of home/away
4 Feature ablation v1.03 ELO dominates; rolling stats are redundant given ELO
5 Window extension v1.05 Windows 7 and 12 give additional logloss gain

Primary models evaluated: hgb_numonly, xgb, logreg, sgd_logloss, baseline.

Show code
import csv
import sys
from pathlib import Path
import pandas as pd

project_root = Path().resolve().parent.parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

_runs_csv = Path("../../_temp/runs.csv")
if _runs_csv.exists():
    rows = list(csv.DictReader(_runs_csv.open()))
    finished = [r for r in rows if r.get("Status", "").upper() == "FINISHED"]
    df = pd.DataFrame(finished)
    print(f"Total FINISHED runs (CSV): {len(finished)}")
else:
    # Load directly from MLflow — matches_clf_v1.0_study (id=25)
    import os
    from src.pipelines._config import get_pipeline_config
    import mlflow
    _cfg = get_pipeline_config()
    os.environ.setdefault("MLFLOW_S3_ENDPOINT_URL", _cfg.minio_endpoint_url)
    os.environ.setdefault("AWS_ACCESS_KEY_ID",      _cfg.minio_access_key)
    os.environ.setdefault("AWS_SECRET_ACCESS_KEY",  _cfg.minio_secret_key)
    mlflow.set_tracking_uri(_cfg.mlflow_tracking_uri)
    _client = mlflow.tracking.MlflowClient()
    _runs = _client.search_runs(
        experiment_ids=["25"],
        filter_string="status = 'FINISHED'",
        max_results=1000,
    )
    _rows = []
    for _r in _runs:
        _row = {"Name": _r.data.tags.get("mlflow.runName", ""), "Status": "FINISHED"}
        _row.update({k: v for k, v in _r.data.params.items()})
        _row.update({k: v for k, v in _r.data.metrics.items()})
        _rows.append(_row)
    df = pd.DataFrame(_rows)
    print(f"Total FINISHED runs (MLflow): {len(df)}")

if not df.empty and "Name" in df.columns:
    print(f"Unique names: {df['Name'].nunique()}")
    if "model.name" in df.columns:
        print(f"Models: {sorted(df['model.name'].dropna().unique())}")
Total FINISHED runs (MLflow): 218
Unique names: 87
Models: ['baseline', 'hgb_numonly', 'logreg', 'sgd_logloss', 'xgb']

Study 0.1 (v1.01) — Initial Exploration

Hypothesis: Can any off-the-shelf model beat a frequency-prior baseline using the default feature set? Four mini-sweeps (size, side, window, ablation) — one configuration per model, 5 models each.

Show code
import pandas as pd

_v1_prefixes = ["size | ", "ablation | ", "side | ", "window | "]
_df_v1 = df[
    df["Name"].str.startswith(tuple(_v1_prefixes), na=False)
    & ~df["Name"].str.startswith("window_", na=False)
].copy()
_df_v1["study"] = _df_v1["Name"].str.split(" | ").str[0]
_df_v1["model"] = _df_v1["Name"].str.split(" | ").str[1]
_df_v1["logloss"] = pd.to_numeric(_df_v1.get("final.logloss"), errors="coerce")
_df_v1["bal_acc"] = pd.to_numeric(_df_v1.get("final.balanced_accuracy"), errors="coerce")

_out_v1 = (
    _df_v1[["study", "model", "logloss", "bal_acc"]]
    .dropna(subset=["logloss"])
    .sort_values(["study", "logloss"])
    .reset_index(drop=True)
)
_out_v1["logloss"] = _out_v1["logloss"].map("{:.4f}".format)
_out_v1["bal_acc"] = _out_v1["bal_acc"].map(lambda x: f"{x*100:.1f}%" if pd.notna(x) else "—")
_out_v1.to_html(index=False)
study model logloss bal_acc
ablation | 1.0587 42.1%
ablation | 1.0587 42.1%
ablation | 1.0587 42.1%
ablation | 1.0587 42.1%
ablation | 1.0699 41.2%
ablation | 1.0699 41.2%
ablation | 1.0712 33.3%
ablation | 1.0712 33.3%
ablation | 1.0712 33.3%
ablation | 1.0712 33.3%
ablation | 1.0712 33.3%
ablation | 1.0712 33.3%
ablation | 1.0846 41.5%
ablation | 1.0846 41.5%
ablation | 1.0846 41.5%
ablation | 1.0846 41.5%
ablation | 1.0950 40.7%
ablation | 1.0950 40.7%
ablation | 1.1308 41.0%
ablation | 1.1308 41.0%
ablation | 1.1308 41.0%
ablation | 1.1308 41.0%
ablation | 1.1404 40.3%
ablation | 1.1404 40.3%
ablation | 7.9452 38.8%
ablation | 7.9452 38.8%
ablation | 8.0116 41.6%
ablation | 8.0116 41.6%
ablation | 8.0116 41.6%
ablation | 8.0116 41.6%
side | 1.0136 41.7%
side | 1.0139 41.7%
side | 1.0146 41.7%
side | 1.0148 41.6%
side | 1.0165 41.6%
side | 1.0172 41.5%
side | 1.0254 42.3%
side | 1.0265 42.5%
side | 1.0286 41.8%
side | 1.0428 37.8%
side | 1.0439 37.7%
side | 1.0449 38.3%
side | 1.0461 38.4%
side | 1.0475 42.2%
side | 1.0513 38.9%
side | 1.0587 42.1%
side | 1.0600 38.9%
side | 1.0635 40.1%
side | 1.0657 38.2%
side | 1.0712 33.3%
side | 1.0712 33.3%
side | 1.0712 33.3%
side | 1.0712 33.3%
side | 1.0712 33.3%
side | 1.0756 41.2%
side | 1.0844 41.3%
side | 1.0846 41.5%
side | 1.1074 41.2%
side | 1.1076 41.2%
side | 1.1199 37.3%
side | 1.1212 37.8%
side | 1.1264 41.1%
side | 1.1308 41.0%
side | 1.1589 37.2%
side | 1.1607 37.5%
side | 1.1883 41.7%
side | 3.5700 38.7%
side | 3.8249 39.9%
side | 8.0116 41.6%
side | 10.0231 41.3%
size | 1.0063 42.5%
size | 1.0070 42.8%
size | 1.0086 42.4%
size | 1.0099 42.7%
size | 1.0213 42.4%
size | 1.0353 42.1%
size | 1.0587 42.1%
size | 1.0711 33.3%
size | 1.0712 33.3%
size | 1.0712 33.3%
size | 1.0716 33.3%
size | 1.0727 33.3%
size | 1.0846 41.5%
size | 1.0987 38.2%
size | 1.1107 41.4%
size | 1.1308 41.0%
size | 1.1568 40.8%
size | 1.2038 40.8%
size | 1.2398 38.4%
size | 1.5534 39.9%
size | 1.6704 39.4%
size | 1.9156 38.2%
size | 8.0116 41.6%
size | 18.4348 40.5%
size | 24.3228 36.3%
window | 1.0136 41.7%
window | 1.0146 41.7%
window | 1.0212 42.1%
window | 1.0258 40.5%
window | 1.0267 40.5%
window | 1.0320 42.1%
window | 1.0358 39.3%
window | 1.0363 39.4%
window | 1.0429 42.1%
window | 1.0432 38.2%
window | 1.0439 38.3%
window | 1.0489 39.0%
window | 1.0513 42.1%
window | 1.0528 36.6%
window | 1.0533 36.7%
window | 1.0587 42.1%
window | 1.0679 36.7%
window | 1.0712 33.3%
window | 1.0712 33.3%
window | 1.0712 33.3%
window | 1.0712 33.3%
window | 1.0712 33.3%
window | 1.0780 39.4%
window | 1.0795 41.1%
window | 1.0796 41.2%
window | 1.0815 41.0%
window | 1.0846 41.5%
window | 1.0847 41.2%
window | 1.1002 40.5%
window | 1.1074 41.2%
window | 1.1108 40.8%
window | 1.1157 41.0%
window | 1.1263 40.8%
window | 1.1266 38.8%
window | 1.1308 41.0%
window | 2.5225 42.8%
window | 4.0185 42.6%
window | 5.8618 39.5%
window | 8.0116 41.6%
window | 11.1841 36.0%

Finding: hgb_numonly and xgb consistently outperform baseline by ~0.04–0.06 logloss. Single-config results confirm models are learnable; motivates the systematic studies in v1.03+.


Study 0.2 (v1.02) — Learning Curve

Hypothesis: Logloss decreases monotonically with training data fraction; models benefit from more data up to frac=1.0.

Show code
import pandas as pd
import matplotlib.pyplot as plt

_df_lc = df[df["Name"].str.startswith("frac=", na=False)].copy()
_df_lc["frac"] = pd.to_numeric(
    _df_lc["Name"].str.extract(r"frac=(\S+) \|")[0], errors="coerce"
)
_df_lc["model"] = _df_lc["Name"].str.split(" | ").str[-1]
_df_lc["logloss"] = pd.to_numeric(_df_lc.get("final.logloss"), errors="coerce")
_df_lc = _df_lc.dropna(subset=["frac", "logloss"]).sort_values(["model", "frac"])

_models_lc = [m for m in ["baseline", "sgd_logloss", "logreg", "hgb_numonly", "xgb"] if m in _df_lc["model"].unique()]
_colors = {"baseline": "gray", "sgd_logloss": "#9C27B0", "logreg": "#FF9800",
           "hgb_numonly": "#2196F3", "xgb": "#4CAF50"}

fig, ax = plt.subplots(figsize=(9, 5))
for _m in _models_lc:
    _sub = _df_lc[_df_lc["model"] == _m].sort_values("frac")
    ax.plot(_sub["frac"], _sub["logloss"], marker="o", label=_m, color=_colors.get(_m))
ax.set_xscale("log")
ax.set_xlabel("Training fraction (log scale)")
ax.set_ylabel("Log-loss (holdout)")
ax.set_title("Learning curve — logloss vs training data fraction")
ax.legend(fontsize=9)
plt.tight_layout()
plt.show()

_summary = (
    _df_lc.groupby("model")["logloss"]
    .agg(best_logloss="min", worst_logloss="max")
)
print("Logloss range per model (min = full data, max = 0.1% data):")
for _m, _row in _summary.iterrows():
    print(f"  {_m:<15} {_row['best_logloss']:.4f} → {_row['worst_logloss']:.4f}")

Logloss range per model (min = full data, max = 0.1% data):
  baseline        1.0711 → 1.0727
  hgb_numonly     1.0087 → 1.6771
  logreg          1.0163 → 1.8958
  sgd_logloss     1.0389 → 23.6461
  xgb             1.0098 → 1.5723
Show code
# reuse _df_lc from cell above — same data, SGD excluded
_models_lc_nosgd = [m for m in ["baseline", "logreg", "hgb_numonly", "xgb"] if m in _df_lc["model"].unique()]
_colors_nosgd    = {"baseline": "gray", "logreg": "#FF9800", "hgb_numonly": "#2196F3", "xgb": "#4CAF50"}

fig, ax = plt.subplots(figsize=(9, 5))
for _m in _models_lc_nosgd:
    _sub = _df_lc[_df_lc["model"] == _m].sort_values("frac")
    ax.plot(_sub["frac"], _sub["logloss"], marker="o", label=_m, color=_colors_nosgd.get(_m))
ax.set_xscale("log")
ax.set_xlabel("Training fraction (log scale)")
ax.set_ylabel("Log-loss (holdout)")
ax.set_title("Learning curve — logloss vs training data fraction (without SGD)")
ax.legend(fontsize=9)
plt.tight_layout()
plt.show()

Finding: hgb_numonly and xgb continue to improve through frac=1.0 — no saturation. baseline is flat (frequency prior is data-independent). logreg and sgd_logloss plateau earlier (~frac=0.25). Decision: all subsequent studies use frac=1.0.


Study 1 (v1.03) — Window Sizes

Hypothesis: Increasing the rolling-window set beyond [1] reduces logloss; effect saturates around 5 windows.

Show code
import pandas as pd

df_w = df[df["Name"].str.contains("window", na=False)].copy()
df_w = df_w[df_w["model.name"].isin(["hgb_numonly", "xgb"])]
df_w = df_w[df_w["params.classification.class_weight"].isna() | (df_w["params.classification.class_weight"] == "")]

cols = {
    "Name": "Name",
    "model.name": "model",
    "params.classification.window_sizes": "windows",
    "final.logloss": "logloss",
    "final.recall_class_1": "recall_draw",
    "final.balanced_accuracy": "bal_acc",
}
df_w2 = df_w[list(cols.keys())].rename(columns=cols).copy()
df_w2["logloss"] = pd.to_numeric(df_w2["logloss"], errors="coerce")
df_w2["recall_draw"] = pd.to_numeric(df_w2["recall_draw"], errors="coerce")
df_w2["bal_acc"] = pd.to_numeric(df_w2["bal_acc"], errors="coerce")

df_w2 = df_w2.dropna(subset=["logloss"])
df_w2 = df_w2.sort_values(["model", "logloss"])

# Keep best run per (name, model) to remove duplicates
df_w2 = df_w2.sort_values("logloss").drop_duplicates(subset=["Name", "model"])

# Format for display
out = df_w2[["model", "windows", "logloss", "recall_draw", "bal_acc"]].copy()
out["logloss"] = out["logloss"].map("{:.4f}".format)
out["recall_draw"] = out["recall_draw"].map(lambda x: f"{x*100:.1f}%")
out["bal_acc"] = out["bal_acc"].map(lambda x: f"{x*100:.1f}%")
out.sort_values(["model", "logloss"]).to_html(index=False)
model windows logloss recall_draw bal_acc
hgb_numonly [1, 2, 3, 5, 10] 1.0136 0.2% 41.7%
xgb [1, 2, 3, 5, 10] 1.0146 0.6% 41.7%

Finding:

windows HGB logloss XGB logloss delta HGB
[1] 1.0528 1.0533 —
[1,2,3,5] 1.0258 1.0267 —
[1,2,3,5,10] 1.0136 1.0146 baseline
  • Each additional window gives a clear monotonic improvement — no plateau at 5 windows.
  • Preliminary decision: window_sizes = [1, 2, 3, 5, 10] — extended in Study 5 (v1.05).

Study 2 (v1.04) — Class Weights

Hypothesis: Up-weighting the draw class (class 1) improves balanced accuracy at acceptable logloss cost.

Show code
df_cw = df[df["model.name"].isin(["hgb_numonly", "xgb"])].copy()
df_cw = df_cw[df_cw["params.classification.frac"].astype(str) == "1.0"]

def cw_label(r):
    cw = r.get("params.classification.class_weight", "")
    c1 = r.get("params.classification.class_weight.1", "")
    if str(cw) == "balanced":
        return "balanced"
    if c1:
        return f"cw1={c1}"
    return "null"

df_cw["cw_label"] = df_cw.apply(cw_label, axis=1)

metrics_cols = {
    "model.name": "model",
    "cw_label": "class_weight",
    "final.logloss": "logloss",
    "final.recall_class_0": "r0_home",
    "final.recall_class_1": "r1_draw",
    "final.recall_class_2": "r2_away",
    "final.balanced_accuracy": "bal_acc",
    "final.accuracy": "acc",
}

df_cw2 = df_cw[list(metrics_cols.keys())].rename(columns=metrics_cols)
for col in ["logloss", "r0_home", "r1_draw", "r2_away", "bal_acc", "acc"]:
    df_cw2[col] = pd.to_numeric(df_cw2[col], errors="coerce")

df_cw2 = df_cw2.dropna(subset=["logloss"]).drop_duplicates(subset=["model", "class_weight"])
df_cw2 = df_cw2.sort_values(["model", "logloss"])

out = df_cw2.copy()
for c in ["r0_home", "r1_draw", "r2_away", "bal_acc", "acc"]:
    out[c] = out[c].map(lambda x: f"{x*100:.1f}%")
out["logloss"] = out["logloss"].map("{:.4f}".format)
out.to_html(index=False)
model class_weight logloss r0_home r1_draw r2_away bal_acc acc
hgb_numonly cw1=1.25 1.0087 80.2% 7.7% 41.5% 43.2% 50.1%
hgb_numonly cw1=nan 1.0136 84.3% 0.2% 40.7% 41.7% 49.8%
hgb_numonly cw1=1.5 1.0202 68.1% 29.7% 34.3% 44.0% 48.0%
hgb_numonly balanced 1.0283 51.1% 32.0% 51.9% 45.0% 46.6%
hgb_numonly cw1=1.75 1.0355 54.0% 51.1% 26.8% 44.0% 44.8%
hgb_numonly cw1=2.0 1.0530 42.7% 66.0% 21.0% 43.2% 41.7%
hgb_numonly cw1=2.5 1.0918 28.4% 81.6% 13.8% 41.3% 37.0%
xgb cw1=1.25 1.0098 79.5% 9.1% 40.9% 43.2% 49.9%
xgb cw1=nan 1.0126 83.7% 0.6% 41.2% 41.8% 49.7%
xgb cw1=1.5 1.0215 67.6% 30.4% 34.1% 44.0% 47.8%
xgb balanced 1.0293 51.0% 32.0% 51.4% 44.8% 46.4%
xgb cw1=1.75 1.0362 54.1% 50.2% 27.1% 43.8% 44.7%
xgb cw1=2.0 1.0532 43.4% 64.4% 21.6% 43.1% 41.8%
xgb cw1=2.5 1.0908 29.5% 79.9% 14.8% 41.4% 37.4%

Finding: Only {1: 1.25} improves logloss while also lifting draw recall from 0.2% to ~9%. All higher weights (1.5+) worsen logloss. Balanced config maximises bal_acc (+3.3pp) but at −0.015 logloss cost.

Decision for production: class_weight = {0: 1.0, 1: 1.25, 2: 1.0} — best logloss + partial draw coverage.


Study 3 (v1.03) — Side Representation

Hypothesis: Including a difference feature (home_stat − away_stat) adds signal on top of raw home/away values.

Show code
df_s = df[df["model.name"].isin(["hgb_numonly", "xgb"])].copy()
df_s = df_s[df_s["Name"].str.contains("side|window", na=False)]

metrics_cols_s = {
    "model.name": "model",
    "params.classification.side": "side",
    "final.logloss": "logloss",
    "final.recall_class_1": "r1_draw",
    "final.balanced_accuracy": "bal_acc",
}

df_s2 = df_s[list(metrics_cols_s.keys())].rename(columns=metrics_cols_s)
df_s2["logloss"] = pd.to_numeric(df_s2["logloss"], errors="coerce")
df_s2 = df_s2.dropna(subset=["logloss", "side"])
df_s2 = df_s2.sort_values(["model", "logloss"]).drop_duplicates(subset=["model", "side"])

out = df_s2.copy()
out["logloss"] = out["logloss"].map("{:.4f}".format)
out["r1_draw"] = pd.to_numeric(df_s2["r1_draw"], errors="coerce").map(lambda x: f"{x*100:.1f}%")
out["bal_acc"] = pd.to_numeric(df_s2["bal_acc"], errors="coerce").map(lambda x: f"{x*100:.1f}%")
out.sort_values(["model", "logloss"]).to_html(index=False)
model side logloss r1_draw bal_acc
hgb_numonly ['home', 'diff', 'away'] 1.0112 0.2% 41.9%
hgb_numonly ['home', 'away'] 1.0139 0.2% 41.7%
hgb_numonly ['diff'] 1.0165 0.0% 41.6%
hgb_numonly ['home'] 1.0428 0.0% 37.8%
hgb_numonly ['away'] 1.0449 0.0% 38.3%
xgb ['home', 'diff', 'away'] 1.0126 0.6% 41.8%
xgb ['home', 'away'] 1.0148 0.5% 41.6%
xgb ['diff'] 1.0172 0.2% 41.5%
xgb ['home'] 1.0439 0.2% 37.7%
xgb ['away'] 1.0461 0.2% 38.4%

Finding:

side HGB logloss note
[‘home’] 1.0428 one-sided — loses context
[‘away’] 1.0449 one-sided
[‘diff’] 1.0165 diff alone beats home or away individually
[‘home’,‘away’] 1.0139 −0.0003 vs full
[‘home’,‘diff’,‘away’] 1.0136 best overall
  • diff alone outperforms each of home or away individually — the delta captures form.
  • Decision: side = ['home', 'diff', 'away'] (already current default).

Study 4 (v1.03) — Feature Ablation

Hypothesis: Not all feature groups contribute equally; ELO may dominate.

Show code
df_a = df[df["Name"].str.startswith("abl_", na=False)].copy()
df_a = df_a[df_a["model.name"].isin(["hgb_numonly", "xgb", "sgd_logloss"])]

metrics_cols_a = {
    "Name": "ablation",
    "model.name": "model",
    "final.logloss": "logloss",
    "final.recall_class_1": "r1_draw",
    "final.balanced_accuracy": "bal_acc",
}

df_a2 = df_a[list(metrics_cols_a.keys())].rename(columns=metrics_cols_a)
df_a2["logloss"] = pd.to_numeric(df_a2["logloss"], errors="coerce")
df_a2 = df_a2.dropna(subset=["logloss"])
df_a2["ablation"] = df_a2["ablation"].str.split(" | ").str[0]
df_a2 = df_a2.sort_values(["model", "logloss"]).drop_duplicates(subset=["ablation", "model"])

out = df_a2.copy()
out["logloss"] = out["logloss"].map("{:.4f}".format)
out["r1_draw"] = pd.to_numeric(df_a2["r1_draw"], errors="coerce").map(lambda x: f"{x*100:.1f}%")
out["bal_acc"] = pd.to_numeric(df_a2["bal_acc"], errors="coerce").map(lambda x: f"{x*100:.1f}%")
out.sort_values(["model", "ablation"]).to_html(index=False)
ablation model logloss r1_draw bal_acc
abl_elo_only hgb_numonly 1.0035 0.2% 42.6%
abl_full hgb_numonly 1.0035 0.2% 42.6%
abl_h2h_only hgb_numonly 1.0136 0.2% 41.7%
abl_no_elo hgb_numonly 1.0136 0.2% 41.7%
abl_no_h2h hgb_numonly 1.0035 0.2% 42.6%
abl_no_rest hgb_numonly 1.0035 0.2% 42.6%
abl_rest_only hgb_numonly 1.0136 0.2% 41.7%
abl_stats_only hgb_numonly 1.0136 0.2% 41.7%
abl_elo_only sgd_logloss 1.1182 32.2% 43.0%
abl_full sgd_logloss 1.1182 32.2% 43.0%
abl_h2h_only sgd_logloss 1.1074 0.0% 41.2%
abl_no_elo sgd_logloss 1.1074 0.0% 41.2%
abl_no_h2h sgd_logloss 1.1182 32.2% 43.0%
abl_no_rest sgd_logloss 1.1182 32.2% 43.0%
abl_rest_only sgd_logloss 1.1074 0.0% 41.2%
abl_stats_only sgd_logloss 1.1074 0.0% 41.2%
abl_elo_only xgb 1.0048 0.6% 42.5%
abl_full xgb 1.0048 0.6% 42.5%
abl_h2h_only xgb 1.0146 0.6% 41.7%
abl_no_elo xgb 1.0146 0.6% 41.7%
abl_no_h2h xgb 1.0048 0.6% 42.5%
abl_no_rest xgb 1.0048 0.6% 42.5%
abl_rest_only xgb 1.0146 0.6% 41.7%
abl_stats_only xgb 1.0146 0.6% 41.7%

Critical finding — ELO is the only source of signal:

Ablation config Contains ELO HGB logloss XGB logloss
abl_full (elo+stats+h2h+rest) ✅ 1.0035 1.0048
abl_elo_only ✅ 1.0035 1.0048
abl_no_h2h (elo+stats+rest) ✅ 1.0035 1.0048
abl_no_rest (elo+stats+h2h) ✅ 1.0035 1.0048
abl_no_elo (stats+h2h+rest) ❌ 1.0136 1.0146
abl_h2h_only ❌ 1.0136 1.0146
abl_rest_only ❌ 1.0136 1.0146
abl_stats_only ❌ 1.0136 1.0146
  • ELO presence/absence is a binary switch (delta = 0.0101 in logloss for HGB).
  • H2H, rest_days, rolling stats have zero marginal contribution when ELO is present.
  • Stats without ELO = same logloss as h2h_only — confirms stats and h2h are not independent signal sources.
  • Possible explanation: ELO already encodes cumulative form — rolling stats are redundant given ELO.

Decisions: - include_h2h = false — zero contribution confirmed. - include_rest_days = false — zero contribution confirmed. - include_elo = true — critical, must remain. - ⚠️ Next investigation: Why do rolling stats add nothing given ELO? Consider feature importance / SHAP analysis.


Study 5 (v1.05) — Window Extension

Hypothesis: Adding windows 7 and 12 on top of the base [1,2,3,5,10] set further reduces logloss; effect is monotonic and plateau has not been reached.

Show code
df_we = df[df["Name"].str.contains("window_1_2_3_5_7", na=False)].copy()
df_we = df_we[df_we["model.name"].isin(["hgb_numonly", "xgb"])]
df_we["logloss"] = pd.to_numeric(df_we.get("final.logloss"), errors="coerce")
df_we = df_we.dropna(subset=["logloss"]).sort_values(["model.name", "logloss"])
df_we[["model.name", "params.classification.window_sizes", "logloss"]].to_html(index=False)
model.name params.classification.window_sizes logloss

Finding:

windows HGB logloss XGB logloss delta vs baseline [1,2,3,5,10]
[1,2,3,5,10] 1.0136 1.0146 0 (Study 1 baseline)
[1,2,3,5,7,10] 1.0136 1.0151 +0.000
[1,2,3,5,7,10,12] 1.0112 1.0126 −0.0024
  • Window 7 adds no improvement; window 12 gives a consistent −0.0024 logloss on HGB.
  • No plateau yet — window 20 was not pursued (see Outcomes).
  • Final decision: window_sizes = [1, 2, 3, 5, 7, 10, 12].

Summary — Selected Configuration for Tuning & Final Training

Parameter Study Old value New value Δ logloss
window_sizes Window (v1.03) + Window Extension (v1.05) [1,2,3,5,10] [1,2,3,5,7,10,12] −0.0024 (HGB)
class_weight CW null {0:1.0, 1:1.25, 2:1.0} −0.0049 (HGB)
side Side [‘home’,‘diff’,‘away’] [‘home’,‘diff’,‘away’] 0 (already optimal)
include_h2h Ablation true false 0 (zero contribution)
include_rest_days Ablation true false 0 (zero contribution)
include_elo Ablation true true critical — keep

Combined estimated improvement from current baseline (HGB, no CW, windows=[1..10]): 1.0136 → ~1.0063 (−0.0073), purely from window extension + optimal CW.

Note: class_weight affects training only, not feature engineering. It should be applied at the final_train stage.


Outcomes

  1. ✅ Full-scale production run completed with updated features_selected (see report 04 — holdout analysis).
  2. ✅ Hyperparameter tuning completed for HGB and XGB (see report 05 — model analysis).
  3. ✅ SHAP analysis conducted — rolling stats are indeed redundant given ELO coverage (see report 05 — feature importance section).
  4. ⏸ window_sizes=[..., 20] not pursued — plateau hypothesis held low priority after ablation results.