Overview
Branches: experiment/study_v1.01, experiment/study_v1.02, experiment/study_v1.03, experiment/study_v1.04, experiment/study_v1.05 Total runs: 218 FINISHED across 87 unique run names. Studies:
0.1
Initial exploration
v1.01
Single-config sweep across all models for size, side, window, ablation
0.2
Learning curve
v1.02
Logloss scales with training data fraction
1
Window sizes (base)
v1.03
Wider window set reduces logloss
2
Class weights
v1.04
Up-weighting draw class (1) improves balanced accuracy
3
Side representation
v1.03
Diff feature adds signal on top of home/away
4
Feature ablation
v1.03
ELO dominates; rolling stats are redundant given ELO
5
Window extension
v1.05
Windows 7 and 12 give additional logloss gain
Primary models evaluated: hgb_numonly, xgb, logreg, sgd_logloss, baseline.
Show code
import csv
import sys
from pathlib import Path
import pandas as pd
project_root = Path().resolve().parent.parent
if str (project_root) not in sys.path:
sys.path.insert(0 , str (project_root))
_runs_csv = Path("../../_temp/runs.csv" )
if _runs_csv.exists():
rows = list (csv.DictReader(_runs_csv.open ()))
finished = [r for r in rows if r.get("Status" , "" ).upper() == "FINISHED" ]
df = pd.DataFrame(finished)
print (f"Total FINISHED runs (CSV): { len (finished)} " )
else :
# Load directly from MLflow — matches_clf_v1.0_study (id=25)
import os
from src.pipelines._config import get_pipeline_config
import mlflow
_cfg = get_pipeline_config()
os.environ.setdefault("MLFLOW_S3_ENDPOINT_URL" , _cfg.minio_endpoint_url)
os.environ.setdefault("AWS_ACCESS_KEY_ID" , _cfg.minio_access_key)
os.environ.setdefault("AWS_SECRET_ACCESS_KEY" , _cfg.minio_secret_key)
mlflow.set_tracking_uri(_cfg.mlflow_tracking_uri)
_client = mlflow.tracking.MlflowClient()
_runs = _client.search_runs(
experiment_ids= ["25" ],
filter_string= "status = 'FINISHED'" ,
max_results= 1000 ,
)
_rows = []
for _r in _runs:
_row = {"Name" : _r.data.tags.get("mlflow.runName" , "" ), "Status" : "FINISHED" }
_row.update({k: v for k, v in _r.data.params.items()})
_row.update({k: v for k, v in _r.data.metrics.items()})
_rows.append(_row)
df = pd.DataFrame(_rows)
print (f"Total FINISHED runs (MLflow): { len (df)} " )
if not df.empty and "Name" in df.columns:
print (f"Unique names: { df['Name' ]. nunique()} " )
if "model.name" in df.columns:
print (f"Models: { sorted (df['model.name' ].dropna().unique())} " )
Total FINISHED runs (MLflow): 218
Unique names: 87
Models: ['baseline', 'hgb_numonly', 'logreg', 'sgd_logloss', 'xgb']
Study 0.1 (v1.01) — Initial Exploration
Hypothesis: Can any off-the-shelf model beat a frequency-prior baseline using the default feature set? Four mini-sweeps (size, side, window, ablation) — one configuration per model, 5 models each.
Show code
import pandas as pd
_v1_prefixes = ["size | " , "ablation | " , "side | " , "window | " ]
_df_v1 = df[
df["Name" ].str .startswith(tuple (_v1_prefixes), na= False )
& ~ df["Name" ].str .startswith("window_" , na= False )
].copy()
_df_v1["study" ] = _df_v1["Name" ].str .split(" | " ).str [0 ]
_df_v1["model" ] = _df_v1["Name" ].str .split(" | " ).str [1 ]
_df_v1["logloss" ] = pd.to_numeric(_df_v1.get("final.logloss" ), errors= "coerce" )
_df_v1["bal_acc" ] = pd.to_numeric(_df_v1.get("final.balanced_accuracy" ), errors= "coerce" )
_out_v1 = (
_df_v1[["study" , "model" , "logloss" , "bal_acc" ]]
.dropna(subset= ["logloss" ])
.sort_values(["study" , "logloss" ])
.reset_index(drop= True )
)
_out_v1["logloss" ] = _out_v1["logloss" ].map (" {:.4f} " .format )
_out_v1["bal_acc" ] = _out_v1["bal_acc" ].map (lambda x: f" { x* 100 :.1f} %" if pd.notna(x) else "—" )
_out_v1.to_html(index= False )
study
model
logloss
bal_acc
ablation
|
1.0587
42.1%
ablation
|
1.0587
42.1%
ablation
|
1.0587
42.1%
ablation
|
1.0587
42.1%
ablation
|
1.0699
41.2%
ablation
|
1.0699
41.2%
ablation
|
1.0712
33.3%
ablation
|
1.0712
33.3%
ablation
|
1.0712
33.3%
ablation
|
1.0712
33.3%
ablation
|
1.0712
33.3%
ablation
|
1.0712
33.3%
ablation
|
1.0846
41.5%
ablation
|
1.0846
41.5%
ablation
|
1.0846
41.5%
ablation
|
1.0846
41.5%
ablation
|
1.0950
40.7%
ablation
|
1.0950
40.7%
ablation
|
1.1308
41.0%
ablation
|
1.1308
41.0%
ablation
|
1.1308
41.0%
ablation
|
1.1308
41.0%
ablation
|
1.1404
40.3%
ablation
|
1.1404
40.3%
ablation
|
7.9452
38.8%
ablation
|
7.9452
38.8%
ablation
|
8.0116
41.6%
ablation
|
8.0116
41.6%
ablation
|
8.0116
41.6%
ablation
|
8.0116
41.6%
side
|
1.0136
41.7%
side
|
1.0139
41.7%
side
|
1.0146
41.7%
side
|
1.0148
41.6%
side
|
1.0165
41.6%
side
|
1.0172
41.5%
side
|
1.0254
42.3%
side
|
1.0265
42.5%
side
|
1.0286
41.8%
side
|
1.0428
37.8%
side
|
1.0439
37.7%
side
|
1.0449
38.3%
side
|
1.0461
38.4%
side
|
1.0475
42.2%
side
|
1.0513
38.9%
side
|
1.0587
42.1%
side
|
1.0600
38.9%
side
|
1.0635
40.1%
side
|
1.0657
38.2%
side
|
1.0712
33.3%
side
|
1.0712
33.3%
side
|
1.0712
33.3%
side
|
1.0712
33.3%
side
|
1.0712
33.3%
side
|
1.0756
41.2%
side
|
1.0844
41.3%
side
|
1.0846
41.5%
side
|
1.1074
41.2%
side
|
1.1076
41.2%
side
|
1.1199
37.3%
side
|
1.1212
37.8%
side
|
1.1264
41.1%
side
|
1.1308
41.0%
side
|
1.1589
37.2%
side
|
1.1607
37.5%
side
|
1.1883
41.7%
side
|
3.5700
38.7%
side
|
3.8249
39.9%
side
|
8.0116
41.6%
side
|
10.0231
41.3%
size
|
1.0063
42.5%
size
|
1.0070
42.8%
size
|
1.0086
42.4%
size
|
1.0099
42.7%
size
|
1.0213
42.4%
size
|
1.0353
42.1%
size
|
1.0587
42.1%
size
|
1.0711
33.3%
size
|
1.0712
33.3%
size
|
1.0712
33.3%
size
|
1.0716
33.3%
size
|
1.0727
33.3%
size
|
1.0846
41.5%
size
|
1.0987
38.2%
size
|
1.1107
41.4%
size
|
1.1308
41.0%
size
|
1.1568
40.8%
size
|
1.2038
40.8%
size
|
1.2398
38.4%
size
|
1.5534
39.9%
size
|
1.6704
39.4%
size
|
1.9156
38.2%
size
|
8.0116
41.6%
size
|
18.4348
40.5%
size
|
24.3228
36.3%
window
|
1.0136
41.7%
window
|
1.0146
41.7%
window
|
1.0212
42.1%
window
|
1.0258
40.5%
window
|
1.0267
40.5%
window
|
1.0320
42.1%
window
|
1.0358
39.3%
window
|
1.0363
39.4%
window
|
1.0429
42.1%
window
|
1.0432
38.2%
window
|
1.0439
38.3%
window
|
1.0489
39.0%
window
|
1.0513
42.1%
window
|
1.0528
36.6%
window
|
1.0533
36.7%
window
|
1.0587
42.1%
window
|
1.0679
36.7%
window
|
1.0712
33.3%
window
|
1.0712
33.3%
window
|
1.0712
33.3%
window
|
1.0712
33.3%
window
|
1.0712
33.3%
window
|
1.0780
39.4%
window
|
1.0795
41.1%
window
|
1.0796
41.2%
window
|
1.0815
41.0%
window
|
1.0846
41.5%
window
|
1.0847
41.2%
window
|
1.1002
40.5%
window
|
1.1074
41.2%
window
|
1.1108
40.8%
window
|
1.1157
41.0%
window
|
1.1263
40.8%
window
|
1.1266
38.8%
window
|
1.1308
41.0%
window
|
2.5225
42.8%
window
|
4.0185
42.6%
window
|
5.8618
39.5%
window
|
8.0116
41.6%
window
|
11.1841
36.0%
Finding: hgb_numonly and xgb consistently outperform baseline by ~0.04–0.06 logloss. Single-config results confirm models are learnable; motivates the systematic studies in v1.03+.
Study 0.2 (v1.02) — Learning Curve
Hypothesis: Logloss decreases monotonically with training data fraction; models benefit from more data up to frac=1.0.
Show code
import pandas as pd
import matplotlib.pyplot as plt
_df_lc = df[df["Name" ].str .startswith("frac=" , na= False )].copy()
_df_lc["frac" ] = pd.to_numeric(
_df_lc["Name" ].str .extract(r"frac= ( \S + ) \| " )[0 ], errors= "coerce"
)
_df_lc["model" ] = _df_lc["Name" ].str .split(" | " ).str [- 1 ]
_df_lc["logloss" ] = pd.to_numeric(_df_lc.get("final.logloss" ), errors= "coerce" )
_df_lc = _df_lc.dropna(subset= ["frac" , "logloss" ]).sort_values(["model" , "frac" ])
_models_lc = [m for m in ["baseline" , "sgd_logloss" , "logreg" , "hgb_numonly" , "xgb" ] if m in _df_lc["model" ].unique()]
_colors = {"baseline" : "gray" , "sgd_logloss" : "#9C27B0" , "logreg" : "#FF9800" ,
"hgb_numonly" : "#2196F3" , "xgb" : "#4CAF50" }
fig, ax = plt.subplots(figsize= (9 , 5 ))
for _m in _models_lc:
_sub = _df_lc[_df_lc["model" ] == _m].sort_values("frac" )
ax.plot(_sub["frac" ], _sub["logloss" ], marker= "o" , label= _m, color= _colors.get(_m))
ax.set_xscale("log" )
ax.set_xlabel("Training fraction (log scale)" )
ax.set_ylabel("Log-loss (holdout)" )
ax.set_title("Learning curve — logloss vs training data fraction" )
ax.legend(fontsize= 9 )
plt.tight_layout()
plt.show()
_summary = (
_df_lc.groupby("model" )["logloss" ]
.agg(best_logloss= "min" , worst_logloss= "max" )
)
print ("Logloss range per model (min = full data, max = 0.1 % d ata):" )
for _m, _row in _summary.iterrows():
print (f" { _m:<15} { _row['best_logloss' ]:.4f} → { _row['worst_logloss' ]:.4f} " )
Logloss range per model (min = full data, max = 0.1% data):
baseline 1.0711 → 1.0727
hgb_numonly 1.0087 → 1.6771
logreg 1.0163 → 1.8958
sgd_logloss 1.0389 → 23.6461
xgb 1.0098 → 1.5723
Show code
# reuse _df_lc from cell above — same data, SGD excluded
_models_lc_nosgd = [m for m in ["baseline" , "logreg" , "hgb_numonly" , "xgb" ] if m in _df_lc["model" ].unique()]
_colors_nosgd = {"baseline" : "gray" , "logreg" : "#FF9800" , "hgb_numonly" : "#2196F3" , "xgb" : "#4CAF50" }
fig, ax = plt.subplots(figsize= (9 , 5 ))
for _m in _models_lc_nosgd:
_sub = _df_lc[_df_lc["model" ] == _m].sort_values("frac" )
ax.plot(_sub["frac" ], _sub["logloss" ], marker= "o" , label= _m, color= _colors_nosgd.get(_m))
ax.set_xscale("log" )
ax.set_xlabel("Training fraction (log scale)" )
ax.set_ylabel("Log-loss (holdout)" )
ax.set_title("Learning curve — logloss vs training data fraction (without SGD)" )
ax.legend(fontsize= 9 )
plt.tight_layout()
plt.show()
Finding: hgb_numonly and xgb continue to improve through frac=1.0 — no saturation. baseline is flat (frequency prior is data-independent). logreg and sgd_logloss plateau earlier (~frac=0.25). Decision: all subsequent studies use frac=1.0.
Study 1 (v1.03) — Window Sizes
Hypothesis: Increasing the rolling-window set beyond [1] reduces logloss; effect saturates around 5 windows.
Show code
import pandas as pd
df_w = df[df["Name" ].str .contains("window" , na= False )].copy()
df_w = df_w[df_w["model.name" ].isin(["hgb_numonly" , "xgb" ])]
df_w = df_w[df_w["params.classification.class_weight" ].isna() | (df_w["params.classification.class_weight" ] == "" )]
cols = {
"Name" : "Name" ,
"model.name" : "model" ,
"params.classification.window_sizes" : "windows" ,
"final.logloss" : "logloss" ,
"final.recall_class_1" : "recall_draw" ,
"final.balanced_accuracy" : "bal_acc" ,
}
df_w2 = df_w[list (cols.keys())].rename(columns= cols).copy()
df_w2["logloss" ] = pd.to_numeric(df_w2["logloss" ], errors= "coerce" )
df_w2["recall_draw" ] = pd.to_numeric(df_w2["recall_draw" ], errors= "coerce" )
df_w2["bal_acc" ] = pd.to_numeric(df_w2["bal_acc" ], errors= "coerce" )
df_w2 = df_w2.dropna(subset= ["logloss" ])
df_w2 = df_w2.sort_values(["model" , "logloss" ])
# Keep best run per (name, model) to remove duplicates
df_w2 = df_w2.sort_values("logloss" ).drop_duplicates(subset= ["Name" , "model" ])
# Format for display
out = df_w2[["model" , "windows" , "logloss" , "recall_draw" , "bal_acc" ]].copy()
out["logloss" ] = out["logloss" ].map (" {:.4f} " .format )
out["recall_draw" ] = out["recall_draw" ].map (lambda x: f" { x* 100 :.1f} %" )
out["bal_acc" ] = out["bal_acc" ].map (lambda x: f" { x* 100 :.1f} %" )
out.sort_values(["model" , "logloss" ]).to_html(index= False )
model
windows
logloss
recall_draw
bal_acc
hgb_numonly
[1, 2, 3, 5, 10]
1.0136
0.2%
41.7%
xgb
[1, 2, 3, 5, 10]
1.0146
0.6%
41.7%
Finding:
[1]
1.0528
1.0533
—
[1,2,3,5]
1.0258
1.0267
—
[1,2,3,5,10]
1.0136
1.0146
baseline
Each additional window gives a clear monotonic improvement — no plateau at 5 windows.
Preliminary decision: window_sizes = [1, 2, 3, 5, 10] — extended in Study 5 (v1.05).
Study 2 (v1.04) — Class Weights
Hypothesis: Up-weighting the draw class (class 1) improves balanced accuracy at acceptable logloss cost.
Show code
df_cw = df[df["model.name" ].isin(["hgb_numonly" , "xgb" ])].copy()
df_cw = df_cw[df_cw["params.classification.frac" ].astype(str ) == "1.0" ]
def cw_label(r):
cw = r.get("params.classification.class_weight" , "" )
c1 = r.get("params.classification.class_weight.1" , "" )
if str (cw) == "balanced" :
return "balanced"
if c1:
return f"cw1= { c1} "
return "null"
df_cw["cw_label" ] = df_cw.apply (cw_label, axis= 1 )
metrics_cols = {
"model.name" : "model" ,
"cw_label" : "class_weight" ,
"final.logloss" : "logloss" ,
"final.recall_class_0" : "r0_home" ,
"final.recall_class_1" : "r1_draw" ,
"final.recall_class_2" : "r2_away" ,
"final.balanced_accuracy" : "bal_acc" ,
"final.accuracy" : "acc" ,
}
df_cw2 = df_cw[list (metrics_cols.keys())].rename(columns= metrics_cols)
for col in ["logloss" , "r0_home" , "r1_draw" , "r2_away" , "bal_acc" , "acc" ]:
df_cw2[col] = pd.to_numeric(df_cw2[col], errors= "coerce" )
df_cw2 = df_cw2.dropna(subset= ["logloss" ]).drop_duplicates(subset= ["model" , "class_weight" ])
df_cw2 = df_cw2.sort_values(["model" , "logloss" ])
out = df_cw2.copy()
for c in ["r0_home" , "r1_draw" , "r2_away" , "bal_acc" , "acc" ]:
out[c] = out[c].map (lambda x: f" { x* 100 :.1f} %" )
out["logloss" ] = out["logloss" ].map (" {:.4f} " .format )
out.to_html(index= False )
model
class_weight
logloss
r0_home
r1_draw
r2_away
bal_acc
acc
hgb_numonly
cw1=1.25
1.0087
80.2%
7.7%
41.5%
43.2%
50.1%
hgb_numonly
cw1=nan
1.0136
84.3%
0.2%
40.7%
41.7%
49.8%
hgb_numonly
cw1=1.5
1.0202
68.1%
29.7%
34.3%
44.0%
48.0%
hgb_numonly
balanced
1.0283
51.1%
32.0%
51.9%
45.0%
46.6%
hgb_numonly
cw1=1.75
1.0355
54.0%
51.1%
26.8%
44.0%
44.8%
hgb_numonly
cw1=2.0
1.0530
42.7%
66.0%
21.0%
43.2%
41.7%
hgb_numonly
cw1=2.5
1.0918
28.4%
81.6%
13.8%
41.3%
37.0%
xgb
cw1=1.25
1.0098
79.5%
9.1%
40.9%
43.2%
49.9%
xgb
cw1=nan
1.0126
83.7%
0.6%
41.2%
41.8%
49.7%
xgb
cw1=1.5
1.0215
67.6%
30.4%
34.1%
44.0%
47.8%
xgb
balanced
1.0293
51.0%
32.0%
51.4%
44.8%
46.4%
xgb
cw1=1.75
1.0362
54.1%
50.2%
27.1%
43.8%
44.7%
xgb
cw1=2.0
1.0532
43.4%
64.4%
21.6%
43.1%
41.8%
xgb
cw1=2.5
1.0908
29.5%
79.9%
14.8%
41.4%
37.4%
Finding: Only {1: 1.25} improves logloss while also lifting draw recall from 0.2% to ~9%. All higher weights (1.5+) worsen logloss. Balanced config maximises bal_acc (+3.3pp) but at −0.015 logloss cost.
Decision for production: class_weight = {0: 1.0, 1: 1.25, 2: 1.0} — best logloss + partial draw coverage.
Study 3 (v1.03) — Side Representation
Hypothesis: Including a difference feature (home_stat − away_stat) adds signal on top of raw home/away values.
Show code
df_s = df[df["model.name" ].isin(["hgb_numonly" , "xgb" ])].copy()
df_s = df_s[df_s["Name" ].str .contains("side|window" , na= False )]
metrics_cols_s = {
"model.name" : "model" ,
"params.classification.side" : "side" ,
"final.logloss" : "logloss" ,
"final.recall_class_1" : "r1_draw" ,
"final.balanced_accuracy" : "bal_acc" ,
}
df_s2 = df_s[list (metrics_cols_s.keys())].rename(columns= metrics_cols_s)
df_s2["logloss" ] = pd.to_numeric(df_s2["logloss" ], errors= "coerce" )
df_s2 = df_s2.dropna(subset= ["logloss" , "side" ])
df_s2 = df_s2.sort_values(["model" , "logloss" ]).drop_duplicates(subset= ["model" , "side" ])
out = df_s2.copy()
out["logloss" ] = out["logloss" ].map (" {:.4f} " .format )
out["r1_draw" ] = pd.to_numeric(df_s2["r1_draw" ], errors= "coerce" ).map (lambda x: f" { x* 100 :.1f} %" )
out["bal_acc" ] = pd.to_numeric(df_s2["bal_acc" ], errors= "coerce" ).map (lambda x: f" { x* 100 :.1f} %" )
out.sort_values(["model" , "logloss" ]).to_html(index= False )
model
side
logloss
r1_draw
bal_acc
hgb_numonly
['home', 'diff', 'away']
1.0112
0.2%
41.9%
hgb_numonly
['home', 'away']
1.0139
0.2%
41.7%
hgb_numonly
['diff']
1.0165
0.0%
41.6%
hgb_numonly
['home']
1.0428
0.0%
37.8%
hgb_numonly
['away']
1.0449
0.0%
38.3%
xgb
['home', 'diff', 'away']
1.0126
0.6%
41.8%
xgb
['home', 'away']
1.0148
0.5%
41.6%
xgb
['diff']
1.0172
0.2%
41.5%
xgb
['home']
1.0439
0.2%
37.7%
xgb
['away']
1.0461
0.2%
38.4%
Finding:
[‘home’]
1.0428
one-sided — loses context
[‘away’]
1.0449
one-sided
[‘diff’]
1.0165
diff alone beats home or away individually
[‘home’,‘away’]
1.0139
−0.0003 vs full
[‘home’,‘diff’,‘away’]
1.0136
best overall
diff alone outperforms each of home or away individually — the delta captures form.
Decision: side = ['home', 'diff', 'away'] (already current default).
Study 4 (v1.03) — Feature Ablation
Hypothesis: Not all feature groups contribute equally; ELO may dominate.
Show code
df_a = df[df["Name" ].str .startswith("abl_" , na= False )].copy()
df_a = df_a[df_a["model.name" ].isin(["hgb_numonly" , "xgb" , "sgd_logloss" ])]
metrics_cols_a = {
"Name" : "ablation" ,
"model.name" : "model" ,
"final.logloss" : "logloss" ,
"final.recall_class_1" : "r1_draw" ,
"final.balanced_accuracy" : "bal_acc" ,
}
df_a2 = df_a[list (metrics_cols_a.keys())].rename(columns= metrics_cols_a)
df_a2["logloss" ] = pd.to_numeric(df_a2["logloss" ], errors= "coerce" )
df_a2 = df_a2.dropna(subset= ["logloss" ])
df_a2["ablation" ] = df_a2["ablation" ].str .split(" | " ).str [0 ]
df_a2 = df_a2.sort_values(["model" , "logloss" ]).drop_duplicates(subset= ["ablation" , "model" ])
out = df_a2.copy()
out["logloss" ] = out["logloss" ].map (" {:.4f} " .format )
out["r1_draw" ] = pd.to_numeric(df_a2["r1_draw" ], errors= "coerce" ).map (lambda x: f" { x* 100 :.1f} %" )
out["bal_acc" ] = pd.to_numeric(df_a2["bal_acc" ], errors= "coerce" ).map (lambda x: f" { x* 100 :.1f} %" )
out.sort_values(["model" , "ablation" ]).to_html(index= False )
ablation
model
logloss
r1_draw
bal_acc
abl_elo_only
hgb_numonly
1.0035
0.2%
42.6%
abl_full
hgb_numonly
1.0035
0.2%
42.6%
abl_h2h_only
hgb_numonly
1.0136
0.2%
41.7%
abl_no_elo
hgb_numonly
1.0136
0.2%
41.7%
abl_no_h2h
hgb_numonly
1.0035
0.2%
42.6%
abl_no_rest
hgb_numonly
1.0035
0.2%
42.6%
abl_rest_only
hgb_numonly
1.0136
0.2%
41.7%
abl_stats_only
hgb_numonly
1.0136
0.2%
41.7%
abl_elo_only
sgd_logloss
1.1182
32.2%
43.0%
abl_full
sgd_logloss
1.1182
32.2%
43.0%
abl_h2h_only
sgd_logloss
1.1074
0.0%
41.2%
abl_no_elo
sgd_logloss
1.1074
0.0%
41.2%
abl_no_h2h
sgd_logloss
1.1182
32.2%
43.0%
abl_no_rest
sgd_logloss
1.1182
32.2%
43.0%
abl_rest_only
sgd_logloss
1.1074
0.0%
41.2%
abl_stats_only
sgd_logloss
1.1074
0.0%
41.2%
abl_elo_only
xgb
1.0048
0.6%
42.5%
abl_full
xgb
1.0048
0.6%
42.5%
abl_h2h_only
xgb
1.0146
0.6%
41.7%
abl_no_elo
xgb
1.0146
0.6%
41.7%
abl_no_h2h
xgb
1.0048
0.6%
42.5%
abl_no_rest
xgb
1.0048
0.6%
42.5%
abl_rest_only
xgb
1.0146
0.6%
41.7%
abl_stats_only
xgb
1.0146
0.6%
41.7%
Critical finding — ELO is the only source of signal:
abl_full (elo+stats+h2h+rest)
✅
1.0035
1.0048
abl_elo_only
✅
1.0035
1.0048
abl_no_h2h (elo+stats+rest)
✅
1.0035
1.0048
abl_no_rest (elo+stats+h2h)
✅
1.0035
1.0048
abl_no_elo (stats+h2h+rest)
❌
1.0136
1.0146
abl_h2h_only
❌
1.0136
1.0146
abl_rest_only
❌
1.0136
1.0146
abl_stats_only
❌
1.0136
1.0146
ELO presence/absence is a binary switch (delta = 0.0101 in logloss for HGB).
H2H, rest_days, rolling stats have zero marginal contribution when ELO is present.
Stats without ELO = same logloss as h2h_only — confirms stats and h2h are not independent signal sources.
Possible explanation: ELO already encodes cumulative form — rolling stats are redundant given ELO.
Decisions: - include_h2h = false — zero contribution confirmed. - include_rest_days = false — zero contribution confirmed. - include_elo = true — critical, must remain. - ⚠️ Next investigation: Why do rolling stats add nothing given ELO? Consider feature importance / SHAP analysis.
Study 5 (v1.05) — Window Extension
Hypothesis: Adding windows 7 and 12 on top of the base [1,2,3,5,10] set further reduces logloss; effect is monotonic and plateau has not been reached.
Show code
df_we = df[df["Name" ].str .contains("window_1_2_3_5_7" , na= False )].copy()
df_we = df_we[df_we["model.name" ].isin(["hgb_numonly" , "xgb" ])]
df_we["logloss" ] = pd.to_numeric(df_we.get("final.logloss" ), errors= "coerce" )
df_we = df_we.dropna(subset= ["logloss" ]).sort_values(["model.name" , "logloss" ])
df_we[["model.name" , "params.classification.window_sizes" , "logloss" ]].to_html(index= False )
model.name
params.classification.window_sizes
logloss
Finding:
[1,2,3,5,10]
1.0136
1.0146
0 (Study 1 baseline)
[1,2,3,5,7,10]
1.0136
1.0151
+0.000
[1,2,3,5,7,10,12]
1.0112
1.0126
−0.0024
Window 7 adds no improvement; window 12 gives a consistent −0.0024 logloss on HGB.
No plateau yet — window 20 was not pursued (see Outcomes).
Final decision: window_sizes = [1, 2, 3, 5, 7, 10, 12].
Summary — Selected Configuration for Tuning & Final Training
window_sizes
Window (v1.03) + Window Extension (v1.05)
[1,2,3,5,10]
[1,2,3,5,7,10,12]
−0.0024 (HGB)
class_weight
CW
null
{0:1.0, 1:1.25, 2:1.0}
−0.0049 (HGB)
side
Side
[‘home’,‘diff’,‘away’]
[‘home’,‘diff’,‘away’]
0 (already optimal)
include_h2h
Ablation
true
false
0 (zero contribution)
include_rest_days
Ablation
true
false
0 (zero contribution)
include_elo
Ablation
true
true
critical — keep
Combined estimated improvement from current baseline (HGB, no CW, windows=[1..10]): 1.0136 → ~1.0063 (−0.0073), purely from window extension + optimal CW.
Note: class_weight affects training only, not feature engineering. It should be applied at the final_train stage.
Outcomes
✅ Full-scale production run completed with updated features_selected (see report 04 — holdout analysis).
✅ Hyperparameter tuning completed for HGB and XGB (see report 05 — model analysis).
✅ SHAP analysis conducted — rolling stats are indeed redundant given ELO coverage (see report 05 — feature importance section).
⏸ window_sizes=[..., 20] not pursued — plateau hypothesis held low priority after ablation results.