Experiment Studies v1.01–v1.05 — Learning Curve, Window, Class Weight, Side, Ablation

218 FINISHED MLflow runs across 7 studies; best-config selection for tuning

Author

Dima Ivanov

Published

May 29, 2026

Overview

Branches: experiment/study_v1.01, experiment/study_v1.02, experiment/study_v1.03, experiment/study_v1.04, experiment/study_v1.05 Total runs: 218 FINISHED across 87 unique run names. Studies:

#	Study	Branch	Hypothesis
0.1	Initial exploration	v1.01	Single-config sweep across all models for size, side, window, ablation
0.2	Learning curve	v1.02	Logloss scales with training data fraction
1	Window sizes (base)	v1.03	Wider window set reduces logloss
2	Class weights	v1.04	Up-weighting draw class (1) improves balanced accuracy
3	Side representation	v1.03	Diff feature adds signal on top of home/away
4	Feature ablation	v1.03	ELO dominates; rolling stats are redundant given ELO
5	Window extension	v1.05	Windows 7 and 12 give additional logloss gain

Primary models evaluated: hgb_numonly, xgb, logreg, sgd_logloss, baseline.

Show code

import csv
import sys
from pathlib import Path
import pandas as pd

project_root = Path().resolve().parent.parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

_runs_csv = Path("../../_temp/runs.csv")
if _runs_csv.exists():
    rows = list(csv.DictReader(_runs_csv.open()))
    finished = [r for r in rows if r.get("Status", "").upper() == "FINISHED"]
    df = pd.DataFrame(finished)
    print(f"Total FINISHED runs (CSV): {len(finished)}")
else:
    # Load directly from MLflow — matches_clf_v1.0_study (id=25)
    import os
    from src.pipelines._config import get_pipeline_config
    import mlflow
    _cfg = get_pipeline_config()
    os.environ.setdefault("MLFLOW_S3_ENDPOINT_URL", _cfg.minio_endpoint_url)
    os.environ.setdefault("AWS_ACCESS_KEY_ID",      _cfg.minio_access_key)
    os.environ.setdefault("AWS_SECRET_ACCESS_KEY",  _cfg.minio_secret_key)
    mlflow.set_tracking_uri(_cfg.mlflow_tracking_uri)
    _client = mlflow.tracking.MlflowClient()
    _runs = _client.search_runs(
        experiment_ids=["25"],
        filter_string="status = 'FINISHED'",
        max_results=1000,
    )
    _rows = []
    for _r in _runs:
        _row = {"Name": _r.data.tags.get("mlflow.runName", ""), "Status": "FINISHED"}
        _row.update({k: v for k, v in _r.data.params.items()})
        _row.update({k: v for k, v in _r.data.metrics.items()})
        _rows.append(_row)
    df = pd.DataFrame(_rows)
    print(f"Total FINISHED runs (MLflow): {len(df)}")

if not df.empty and "Name" in df.columns:
    print(f"Unique names: {df['Name'].nunique()}")
    if "model.name" in df.columns:
        print(f"Models: {sorted(df['model.name'].dropna().unique())}")

Total FINISHED runs (MLflow): 218
Unique names: 87
Models: ['baseline', 'hgb_numonly', 'logreg', 'sgd_logloss', 'xgb']

Study 0.1 (v1.01) — Initial Exploration

Hypothesis: Can any off-the-shelf model beat a frequency-prior baseline using the default feature set? Four mini-sweeps (size, side, window, ablation) — one configuration per model, 5 models each.

Show code

import pandas as pd

_v1_prefixes = ["size | ", "ablation | ", "side | ", "window | "]
_df_v1 = df[
    df["Name"].str.startswith(tuple(_v1_prefixes), na=False)
    & ~df["Name"].str.startswith("window_", na=False)
].copy()
_df_v1["study"] = _df_v1["Name"].str.split(" | ").str[0]
_df_v1["model"] = _df_v1["Name"].str.split(" | ").str[1]
_df_v1["logloss"] = pd.to_numeric(_df_v1.get("final.logloss"), errors="coerce")
_df_v1["bal_acc"] = pd.to_numeric(_df_v1.get("final.balanced_accuracy"), errors="coerce")

_out_v1 = (
    _df_v1[["study", "model", "logloss", "bal_acc"]]
    .dropna(subset=["logloss"])
    .sort_values(["study", "logloss"])
    .reset_index(drop=True)
)
_out_v1["logloss"] = _out_v1["logloss"].map("{:.4f}".format)
_out_v1["bal_acc"] = _out_v1["bal_acc"].map(lambda x: f"{x*100:.1f}%" if pd.notna(x) else "—")
_out_v1.to_html(index=False)

study	model	logloss	bal_acc
ablation	\|	1.0587	42.1%
ablation	\|	1.0587	42.1%
ablation	\|	1.0587	42.1%
ablation	\|	1.0587	42.1%
ablation	\|	1.0699	41.2%
ablation	\|	1.0699	41.2%
ablation	\|	1.0712	33.3%
ablation	\|	1.0712	33.3%
ablation	\|	1.0712	33.3%
ablation	\|	1.0712	33.3%
ablation	\|	1.0712	33.3%
ablation	\|	1.0712	33.3%
ablation	\|	1.0846	41.5%
ablation	\|	1.0846	41.5%
ablation	\|	1.0846	41.5%
ablation	\|	1.0846	41.5%
ablation	\|	1.0950	40.7%
ablation	\|	1.0950	40.7%
ablation	\|	1.1308	41.0%
ablation	\|	1.1308	41.0%
ablation	\|	1.1308	41.0%
ablation	\|	1.1308	41.0%
ablation	\|	1.1404	40.3%
ablation	\|	1.1404	40.3%
ablation	\|	7.9452	38.8%
ablation	\|	7.9452	38.8%
ablation	\|	8.0116	41.6%
ablation	\|	8.0116	41.6%
ablation	\|	8.0116	41.6%
ablation	\|	8.0116	41.6%
side	\|	1.0136	41.7%
side	\|	1.0139	41.7%
side	\|	1.0146	41.7%
side	\|	1.0148	41.6%
side	\|	1.0165	41.6%
side	\|	1.0172	41.5%
side	\|	1.0254	42.3%
side	\|	1.0265	42.5%
side	\|	1.0286	41.8%
side	\|	1.0428	37.8%
side	\|	1.0439	37.7%
side	\|	1.0449	38.3%
side	\|	1.0461	38.4%
side	\|	1.0475	42.2%
side	\|	1.0513	38.9%
side	\|	1.0587	42.1%
side	\|	1.0600	38.9%
side	\|	1.0635	40.1%
side	\|	1.0657	38.2%
side	\|	1.0712	33.3%
side	\|	1.0712	33.3%
side	\|	1.0712	33.3%
side	\|	1.0712	33.3%
side	\|	1.0712	33.3%
side	\|	1.0756	41.2%
side	\|	1.0844	41.3%
side	\|	1.0846	41.5%
side	\|	1.1074	41.2%
side	\|	1.1076	41.2%
side	\|	1.1199	37.3%
side	\|	1.1212	37.8%
side	\|	1.1264	41.1%
side	\|	1.1308	41.0%
side	\|	1.1589	37.2%
side	\|	1.1607	37.5%
side	\|	1.1883	41.7%
side	\|	3.5700	38.7%
side	\|	3.8249	39.9%
side	\|	8.0116	41.6%
side	\|	10.0231	41.3%
size	\|	1.0063	42.5%
size	\|	1.0070	42.8%
size	\|	1.0086	42.4%
size	\|	1.0099	42.7%
size	\|	1.0213	42.4%
size	\|	1.0353	42.1%
size	\|	1.0587	42.1%
size	\|	1.0711	33.3%
size	\|	1.0712	33.3%
size	\|	1.0712	33.3%
size	\|	1.0716	33.3%
size	\|	1.0727	33.3%
size	\|	1.0846	41.5%
size	\|	1.0987	38.2%
size	\|	1.1107	41.4%
size	\|	1.1308	41.0%
size	\|	1.1568	40.8%
size	\|	1.2038	40.8%
size	\|	1.2398	38.4%
size	\|	1.5534	39.9%
size	\|	1.6704	39.4%
size	\|	1.9156	38.2%
size	\|	8.0116	41.6%
size	\|	18.4348	40.5%
size	\|	24.3228	36.3%
window	\|	1.0136	41.7%
window	\|	1.0146	41.7%
window	\|	1.0212	42.1%
window	\|	1.0258	40.5%
window	\|	1.0267	40.5%
window	\|	1.0320	42.1%
window	\|	1.0358	39.3%
window	\|	1.0363	39.4%
window	\|	1.0429	42.1%
window	\|	1.0432	38.2%
window	\|	1.0439	38.3%
window	\|	1.0489	39.0%
window	\|	1.0513	42.1%
window	\|	1.0528	36.6%
window	\|	1.0533	36.7%
window	\|	1.0587	42.1%
window	\|	1.0679	36.7%
window	\|	1.0712	33.3%
window	\|	1.0712	33.3%
window	\|	1.0712	33.3%
window	\|	1.0712	33.3%
window	\|	1.0712	33.3%
window	\|	1.0780	39.4%
window	\|	1.0795	41.1%
window	\|	1.0796	41.2%
window	\|	1.0815	41.0%
window	\|	1.0846	41.5%
window	\|	1.0847	41.2%
window	\|	1.1002	40.5%
window	\|	1.1074	41.2%
window	\|	1.1108	40.8%
window	\|	1.1157	41.0%
window	\|	1.1263	40.8%
window	\|	1.1266	38.8%
window	\|	1.1308	41.0%
window	\|	2.5225	42.8%
window	\|	4.0185	42.6%
window	\|	5.8618	39.5%
window	\|	8.0116	41.6%
window	\|	11.1841	36.0%

Finding: hgb_numonly and xgb consistently outperform baseline by ~0.04–0.06 logloss. Single-config results confirm models are learnable; motivates the systematic studies in v1.03+.

Study 0.2 (v1.02) — Learning Curve

Hypothesis: Logloss decreases monotonically with training data fraction; models benefit from more data up to frac=1.0.

Show code

import pandas as pd
import matplotlib.pyplot as plt

_df_lc = df[df["Name"].str.startswith("frac=", na=False)].copy()
_df_lc["frac"] = pd.to_numeric(
    _df_lc["Name"].str.extract(r"frac=(\S+) \|")[0], errors="coerce"
)
_df_lc["model"] = _df_lc["Name"].str.split(" | ").str[-1]
_df_lc["logloss"] = pd.to_numeric(_df_lc.get("final.logloss"), errors="coerce")
_df_lc = _df_lc.dropna(subset=["frac", "logloss"]).sort_values(["model", "frac"])

_models_lc = [m for m in ["baseline", "sgd_logloss", "logreg", "hgb_numonly", "xgb"] if m in _df_lc["model"].unique()]
_colors = {"baseline": "gray", "sgd_logloss": "#9C27B0", "logreg": "#FF9800",
           "hgb_numonly": "#2196F3", "xgb": "#4CAF50"}

fig, ax = plt.subplots(figsize=(9, 5))
for _m in _models_lc:
    _sub = _df_lc[_df_lc["model"] == _m].sort_values("frac")
    ax.plot(_sub["frac"], _sub["logloss"], marker="o", label=_m, color=_colors.get(_m))
ax.set_xscale("log")
ax.set_xlabel("Training fraction (log scale)")
ax.set_ylabel("Log-loss (holdout)")
ax.set_title("Learning curve — logloss vs training data fraction")
ax.legend(fontsize=9)
plt.tight_layout()
plt.show()

_summary = (
    _df_lc.groupby("model")["logloss"]
    .agg(best_logloss="min", worst_logloss="max")
)
print("Logloss range per model (min = full data, max = 0.1% data):")
for _m, _row in _summary.iterrows():
    print(f"  {_m:<15} {_row['best_logloss']:.4f} → {_row['worst_logloss']:.4f}")

Logloss range per model (min = full data, max = 0.1% data):
  baseline        1.0711 → 1.0727
  hgb_numonly     1.0087 → 1.6771
  logreg          1.0163 → 1.8958
  sgd_logloss     1.0389 → 23.6461
  xgb             1.0098 → 1.5723

Show code

# reuse _df_lc from cell above — same data, SGD excluded
_models_lc_nosgd = [m for m in ["baseline", "logreg", "hgb_numonly", "xgb"] if m in _df_lc["model"].unique()]
_colors_nosgd    = {"baseline": "gray", "logreg": "#FF9800", "hgb_numonly": "#2196F3", "xgb": "#4CAF50"}

fig, ax = plt.subplots(figsize=(9, 5))
for _m in _models_lc_nosgd:
    _sub = _df_lc[_df_lc["model"] == _m].sort_values("frac")
    ax.plot(_sub["frac"], _sub["logloss"], marker="o", label=_m, color=_colors_nosgd.get(_m))
ax.set_xscale("log")
ax.set_xlabel("Training fraction (log scale)")
ax.set_ylabel("Log-loss (holdout)")
ax.set_title("Learning curve — logloss vs training data fraction (without SGD)")
ax.legend(fontsize=9)
plt.tight_layout()
plt.show()

Finding: hgb_numonly and xgb continue to improve through frac=1.0 — no saturation. baseline is flat (frequency prior is data-independent). logreg and sgd_logloss plateau earlier (~frac=0.25). Decision: all subsequent studies use frac=1.0.

Study 1 (v1.03) — Window Sizes

Hypothesis: Increasing the rolling-window set beyond [1] reduces logloss; effect saturates around 5 windows.

Show code

import pandas as pd

df_w = df[df["Name"].str.contains("window", na=False)].copy()
df_w = df_w[df_w["model.name"].isin(["hgb_numonly", "xgb"])]
df_w = df_w[df_w["params.classification.class_weight"].isna() | (df_w["params.classification.class_weight"] == "")]

cols = {
    "Name": "Name",
    "model.name": "model",
    "params.classification.window_sizes": "windows",
    "final.logloss": "logloss",
    "final.recall_class_1": "recall_draw",
    "final.balanced_accuracy": "bal_acc",
}
df_w2 = df_w[list(cols.keys())].rename(columns=cols).copy()
df_w2["logloss"] = pd.to_numeric(df_w2["logloss"], errors="coerce")
df_w2["recall_draw"] = pd.to_numeric(df_w2["recall_draw"], errors="coerce")
df_w2["bal_acc"] = pd.to_numeric(df_w2["bal_acc"], errors="coerce")

df_w2 = df_w2.dropna(subset=["logloss"])
df_w2 = df_w2.sort_values(["model", "logloss"])

# Keep best run per (name, model) to remove duplicates
df_w2 = df_w2.sort_values("logloss").drop_duplicates(subset=["Name", "model"])

# Format for display
out = df_w2[["model", "windows", "logloss", "recall_draw", "bal_acc"]].copy()
out["logloss"] = out["logloss"].map("{:.4f}".format)
out["recall_draw"] = out["recall_draw"].map(lambda x: f"{x*100:.1f}%")
out["bal_acc"] = out["bal_acc"].map(lambda x: f"{x*100:.1f}%")
out.sort_values(["model", "logloss"]).to_html(index=False)

model	windows	logloss	recall_draw	bal_acc
hgb_numonly	[1, 2, 3, 5, 10]	1.0136	0.2%	41.7%
xgb	[1, 2, 3, 5, 10]	1.0146	0.6%	41.7%

Finding:

windows	HGB logloss	XGB logloss	delta HGB
[1]	1.0528	1.0533	—
[1,2,3,5]	1.0258	1.0267	—
[1,2,3,5,10]	1.0136	1.0146	baseline

Each additional window gives a clear monotonic improvement — no plateau at 5 windows.
Preliminary decision: window_sizes = [1, 2, 3, 5, 10] — extended in Study 5 (v1.05).

Study 2 (v1.04) — Class Weights

Hypothesis: Up-weighting the draw class (class 1) improves balanced accuracy at acceptable logloss cost.

Show code

df_cw = df[df["model.name"].isin(["hgb_numonly", "xgb"])].copy()
df_cw = df_cw[df_cw["params.classification.frac"].astype(str) == "1.0"]

def cw_label(r):
    cw = r.get("params.classification.class_weight", "")
    c1 = r.get("params.classification.class_weight.1", "")
    if str(cw) == "balanced":
        return "balanced"
    if c1:
        return f"cw1={c1}"
    return "null"

df_cw["cw_label"] = df_cw.apply(cw_label, axis=1)

metrics_cols = {
    "model.name": "model",
    "cw_label": "class_weight",
    "final.logloss": "logloss",
    "final.recall_class_0": "r0_home",
    "final.recall_class_1": "r1_draw",
    "final.recall_class_2": "r2_away",
    "final.balanced_accuracy": "bal_acc",
    "final.accuracy": "acc",
}

df_cw2 = df_cw[list(metrics_cols.keys())].rename(columns=metrics_cols)
for col in ["logloss", "r0_home", "r1_draw", "r2_away", "bal_acc", "acc"]:
    df_cw2[col] = pd.to_numeric(df_cw2[col], errors="coerce")

df_cw2 = df_cw2.dropna(subset=["logloss"]).drop_duplicates(subset=["model", "class_weight"])
df_cw2 = df_cw2.sort_values(["model", "logloss"])

out = df_cw2.copy()
for c in ["r0_home", "r1_draw", "r2_away", "bal_acc", "acc"]:
    out[c] = out[c].map(lambda x: f"{x*100:.1f}%")
out["logloss"] = out["logloss"].map("{:.4f}".format)
out.to_html(index=False)

model	class_weight	logloss	r0_home	r1_draw	r2_away	bal_acc	acc
hgb_numonly	cw1=1.25	1.0087	80.2%	7.7%	41.5%	43.2%	50.1%
hgb_numonly	cw1=nan	1.0136	84.3%	0.2%	40.7%	41.7%	49.8%
hgb_numonly	cw1=1.5	1.0202	68.1%	29.7%	34.3%	44.0%	48.0%
hgb_numonly	balanced	1.0283	51.1%	32.0%	51.9%	45.0%	46.6%
hgb_numonly	cw1=1.75	1.0355	54.0%	51.1%	26.8%	44.0%	44.8%
hgb_numonly	cw1=2.0	1.0530	42.7%	66.0%	21.0%	43.2%	41.7%
hgb_numonly	cw1=2.5	1.0918	28.4%	81.6%	13.8%	41.3%	37.0%
xgb	cw1=1.25	1.0098	79.5%	9.1%	40.9%	43.2%	49.9%
xgb	cw1=nan	1.0126	83.7%	0.6%	41.2%	41.8%	49.7%
xgb	cw1=1.5	1.0215	67.6%	30.4%	34.1%	44.0%	47.8%
xgb	balanced	1.0293	51.0%	32.0%	51.4%	44.8%	46.4%
xgb	cw1=1.75	1.0362	54.1%	50.2%	27.1%	43.8%	44.7%
xgb	cw1=2.0	1.0532	43.4%	64.4%	21.6%	43.1%	41.8%
xgb	cw1=2.5	1.0908	29.5%	79.9%	14.8%	41.4%	37.4%

Finding: Only {1: 1.25} improves logloss while also lifting draw recall from 0.2% to ~9%. All higher weights (1.5+) worsen logloss. Balanced config maximises bal_acc (+3.3pp) but at −0.015 logloss cost.

Decision for production: class_weight = {0: 1.0, 1: 1.25, 2: 1.0} — best logloss + partial draw coverage.

Study 3 (v1.03) — Side Representation

Hypothesis: Including a difference feature (home_stat − away_stat) adds signal on top of raw home/away values.

Show code

df_s = df[df["model.name"].isin(["hgb_numonly", "xgb"])].copy()
df_s = df_s[df_s["Name"].str.contains("side|window", na=False)]

metrics_cols_s = {
    "model.name": "model",
    "params.classification.side": "side",
    "final.logloss": "logloss",
    "final.recall_class_1": "r1_draw",
    "final.balanced_accuracy": "bal_acc",
}

df_s2 = df_s[list(metrics_cols_s.keys())].rename(columns=metrics_cols_s)
df_s2["logloss"] = pd.to_numeric(df_s2["logloss"], errors="coerce")
df_s2 = df_s2.dropna(subset=["logloss", "side"])
df_s2 = df_s2.sort_values(["model", "logloss"]).drop_duplicates(subset=["model", "side"])

out = df_s2.copy()
out["logloss"] = out["logloss"].map("{:.4f}".format)
out["r1_draw"] = pd.to_numeric(df_s2["r1_draw"], errors="coerce").map(lambda x: f"{x*100:.1f}%")
out["bal_acc"] = pd.to_numeric(df_s2["bal_acc"], errors="coerce").map(lambda x: f"{x*100:.1f}%")
out.sort_values(["model", "logloss"]).to_html(index=False)

model	side	logloss	r1_draw	bal_acc
hgb_numonly	['home', 'diff', 'away']	1.0112	0.2%	41.9%
hgb_numonly	['home', 'away']	1.0139	0.2%	41.7%
hgb_numonly	['diff']	1.0165	0.0%	41.6%
hgb_numonly	['home']	1.0428	0.0%	37.8%
hgb_numonly	['away']	1.0449	0.0%	38.3%
xgb	['home', 'diff', 'away']	1.0126	0.6%	41.8%
xgb	['home', 'away']	1.0148	0.5%	41.6%
xgb	['diff']	1.0172	0.2%	41.5%
xgb	['home']	1.0439	0.2%	37.7%
xgb	['away']	1.0461	0.2%	38.4%

Finding:

side	HGB logloss	note
[‘home’]	1.0428	one-sided — loses context
[‘away’]	1.0449	one-sided
[‘diff’]	1.0165	diff alone beats home or away individually
[‘home’,‘away’]	1.0139	−0.0003 vs full
[‘home’,‘diff’,‘away’]	1.0136	best overall

diff alone outperforms each of home or away individually — the delta captures form.
Decision: side = ['home', 'diff', 'away'] (already current default).

Study 4 (v1.03) — Feature Ablation

Hypothesis: Not all feature groups contribute equally; ELO may dominate.

Show code

df_a = df[df["Name"].str.startswith("abl_", na=False)].copy()
df_a = df_a[df_a["model.name"].isin(["hgb_numonly", "xgb", "sgd_logloss"])]

metrics_cols_a = {
    "Name": "ablation",
    "model.name": "model",
    "final.logloss": "logloss",
    "final.recall_class_1": "r1_draw",
    "final.balanced_accuracy": "bal_acc",
}

df_a2 = df_a[list(metrics_cols_a.keys())].rename(columns=metrics_cols_a)
df_a2["logloss"] = pd.to_numeric(df_a2["logloss"], errors="coerce")
df_a2 = df_a2.dropna(subset=["logloss"])
df_a2["ablation"] = df_a2["ablation"].str.split(" | ").str[0]
df_a2 = df_a2.sort_values(["model", "logloss"]).drop_duplicates(subset=["ablation", "model"])

out = df_a2.copy()
out["logloss"] = out["logloss"].map("{:.4f}".format)
out["r1_draw"] = pd.to_numeric(df_a2["r1_draw"], errors="coerce").map(lambda x: f"{x*100:.1f}%")
out["bal_acc"] = pd.to_numeric(df_a2["bal_acc"], errors="coerce").map(lambda x: f"{x*100:.1f}%")
out.sort_values(["model", "ablation"]).to_html(index=False)

ablation	model	logloss	r1_draw	bal_acc
abl_elo_only	hgb_numonly	1.0035	0.2%	42.6%
abl_full	hgb_numonly	1.0035	0.2%	42.6%
abl_h2h_only	hgb_numonly	1.0136	0.2%	41.7%
abl_no_elo	hgb_numonly	1.0136	0.2%	41.7%
abl_no_h2h	hgb_numonly	1.0035	0.2%	42.6%
abl_no_rest	hgb_numonly	1.0035	0.2%	42.6%
abl_rest_only	hgb_numonly	1.0136	0.2%	41.7%
abl_stats_only	hgb_numonly	1.0136	0.2%	41.7%
abl_elo_only	sgd_logloss	1.1182	32.2%	43.0%
abl_full	sgd_logloss	1.1182	32.2%	43.0%
abl_h2h_only	sgd_logloss	1.1074	0.0%	41.2%
abl_no_elo	sgd_logloss	1.1074	0.0%	41.2%
abl_no_h2h	sgd_logloss	1.1182	32.2%	43.0%
abl_no_rest	sgd_logloss	1.1182	32.2%	43.0%
abl_rest_only	sgd_logloss	1.1074	0.0%	41.2%
abl_stats_only	sgd_logloss	1.1074	0.0%	41.2%
abl_elo_only	xgb	1.0048	0.6%	42.5%
abl_full	xgb	1.0048	0.6%	42.5%
abl_h2h_only	xgb	1.0146	0.6%	41.7%
abl_no_elo	xgb	1.0146	0.6%	41.7%
abl_no_h2h	xgb	1.0048	0.6%	42.5%
abl_no_rest	xgb	1.0048	0.6%	42.5%
abl_rest_only	xgb	1.0146	0.6%	41.7%
abl_stats_only	xgb	1.0146	0.6%	41.7%

Critical finding — ELO is the only source of signal:

Ablation config	Contains ELO	HGB logloss	XGB logloss
abl_full (elo+stats+h2h+rest)	✅	1.0035	1.0048
abl_elo_only	✅	1.0035	1.0048
abl_no_h2h (elo+stats+rest)	✅	1.0035	1.0048
abl_no_rest (elo+stats+h2h)	✅	1.0035	1.0048
abl_no_elo (stats+h2h+rest)	❌	1.0136	1.0146
abl_h2h_only	❌	1.0136	1.0146
abl_rest_only	❌	1.0136	1.0146
abl_stats_only	❌	1.0136	1.0146

ELO presence/absence is a binary switch (delta = 0.0101 in logloss for HGB).
H2H, rest_days, rolling stats have zero marginal contribution when ELO is present.
Stats without ELO = same logloss as h2h_only — confirms stats and h2h are not independent signal sources.
Possible explanation: ELO already encodes cumulative form — rolling stats are redundant given ELO.

Decisions: - include_h2h = false — zero contribution confirmed. - include_rest_days = false — zero contribution confirmed. - include_elo = true — critical, must remain. - ⚠️ Next investigation: Why do rolling stats add nothing given ELO? Consider feature importance / SHAP analysis.

Study 5 (v1.05) — Window Extension

Hypothesis: Adding windows 7 and 12 on top of the base [1,2,3,5,10] set further reduces logloss; effect is monotonic and plateau has not been reached.

Show code

df_we = df[df["Name"].str.contains("window_1_2_3_5_7", na=False)].copy()
df_we = df_we[df_we["model.name"].isin(["hgb_numonly", "xgb"])]
df_we["logloss"] = pd.to_numeric(df_we.get("final.logloss"), errors="coerce")
df_we = df_we.dropna(subset=["logloss"]).sort_values(["model.name", "logloss"])
df_we[["model.name", "params.classification.window_sizes", "logloss"]].to_html(index=False)

model.name	params.classification.window_sizes	logloss

Finding:

windows	HGB logloss	XGB logloss	delta vs baseline [1,2,3,5,10]
[1,2,3,5,10]	1.0136	1.0146	0 (Study 1 baseline)
[1,2,3,5,7,10]	1.0136	1.0151	+0.000
[1,2,3,5,7,10,12]	1.0112	1.0126	−0.0024

Window 7 adds no improvement; window 12 gives a consistent −0.0024 logloss on HGB.
No plateau yet — window 20 was not pursued (see Outcomes).
Final decision: window_sizes = [1, 2, 3, 5, 7, 10, 12].

Summary — Selected Configuration for Tuning & Final Training

Parameter	Study	Old value	New value	Δ logloss
`window_sizes`	Window (v1.03) + Window Extension (v1.05)	[1,2,3,5,10]	[1,2,3,5,7,10,12]	−0.0024 (HGB)
`class_weight`	CW	null	{0:1.0, 1:1.25, 2:1.0}	−0.0049 (HGB)
`side`	Side	[‘home’,‘diff’,‘away’]	[‘home’,‘diff’,‘away’]	0 (already optimal)
`include_h2h`	Ablation	true	false	0 (zero contribution)
`include_rest_days`	Ablation	true	false	0 (zero contribution)
`include_elo`	Ablation	true	true	critical — keep

Combined estimated improvement from current baseline (HGB, no CW, windows=[1..10]): 1.0136 → ~1.0063 (−0.0073), purely from window extension + optimal CW.

Note: class_weight affects training only, not feature engineering. It should be applied at the final_train stage.

Outcomes

✅ Full-scale production run completed with updated features_selected (see report 04 — holdout analysis).
✅ Hyperparameter tuning completed for HGB and XGB (see report 05 — model analysis).
✅ SHAP analysis conducted — rolling stats are indeed redundant given ELO coverage (see report 05 — feature importance section).
⏸ window_sizes=[..., 20] not pursued — plateau hypothesis held low priority after ablation results.