Features¶
Feature Selector¶
Shared feature-column selector for all model-training and serving stages.
Every stage that trains, evaluates, or serves predictions must use
select_model_features to guarantee a consistent feature contract between
classification, tuning, final training, ablation, and batch inference.
Feature columns are derived from features_meta.parquet, which is produced
by the feature_engineering DVC stage and is the single source of truth for
available features.
select_model_features(features_meta, side, window_sizes, include_elo=True, include_rest_days=True, include_h2h=False)
¶
Return numeric feature column names matching the given profile.
Parameters¶
features_meta:
features_meta.parquet loaded as a DataFrame.
Expected columns: name, side, scope, metric, window.
side:
Which perspective(s) to select: "home", "away", or "diff".
Accepts a list of strings, e.g. ["diff"] or
["home", "away", "diff"].
window_sizes:
Rolling windows to include for windowed stat features (e.g. [1, 3]).
Pass an empty list to exclude windowed stats (e.g. for elo_only).
include_elo:
Include ELO pre-match ratings (metric == "elo_pre").
include_rest_days:
Include rest-days features (metric == "rest_days").
Currently not present in features_meta; kept for forward compatibility.
include_h2h:
Include head-to-head features (scope == "h2h").
Currently not present in features_meta; kept for forward compatibility.
Returns¶
list[str] Ordered list of numeric feature column names.
Source code in src/features/select.py
ELO Ratings¶
ELO rating features for football match prediction.
ELO ratings are computed per-tournament, propagating forward in time. The value attached to each match row is the PRE-match rating (before the result is known), which guarantees zero data leakage.
Implementation notes¶
- Ratings are maintained per (tournament, team) pair so that clubs that play in multiple competitions carry independent ratings.
- Home advantage is modelled as a fixed additive bonus to the home team's expected score calculation, not as a separate rating.
- The algorithm is intentionally sequential (Python loop) because ELO has a strict temporal dependency. Vectorisation is not applicable here.
References¶
Elo, A. E. (1978). The Rating of Chessplayers, Past and Present. Arco Publishing. Football application overview: https://www.eloratings.net/about
compute_elo_ratings(df, k_factor=_DEFAULT_K, initial_rating=_DEFAULT_INITIAL, home_advantage=_DEFAULT_HOME_ADV, group_col='tournamentId')
¶
Compute pre-match ELO ratings and attach them to each match row.
Parameters¶
df:
Match-level DataFrame. Required columns: id, startTimeUtc,
homeTeamId, awayTeamId, outcome_1x2, and group_col.
Rows corresponding to future matches (outcome_1x2 is NaN)
receive pre-match ELO values but do not trigger a rating update.
k_factor:
ELO update step size. Larger K means faster adaptation to recent
results. Typical range: 20–40.
initial_rating:
Starting ELO assigned to teams with no prior history in the group.
home_advantage:
Rating-point bonus applied to the home team's expected score.
Set to 0 to disable.
group_col:
Column used to scope ELO state. Default tournamentId keeps
league-specific rating histories independent.
Returns¶
pd.DataFrame
A copy of df with three new columns:
- ``home_elo_pre`` – home team ELO before this match (float32)
- ``away_elo_pre`` – away team ELO before this match (float32)
- ``diff_elo_pre`` – ``home_elo_pre`` − ``away_elo_pre`` (float32)
Notes¶
The pre-match snapshot ensures the features contain no information about the result being predicted — use-after-fit leakage cannot occur.
Source code in src/features/elo.py
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | |
elo_feature_meta(side='diff')
¶
Return feature metadata rows compatible with the rolling-features schema.
Parameters¶
side:
Which perspective to surface: "home", "away", or "diff".
Passing "diff" returns only the home-minus-away column, which is
what the default classification pipeline uses.
Returns¶
list of dicts with keys: name, side, scope, metric, agg, window.
Source code in src/features/elo.py
Rolling Match Statistics¶
build_team_match_table(df)
¶
Reshape a match-level DataFrame into a team-match-level table.
Each match in df produces two rows — one for the home team and
one for the away team — with is_home, goals_for,
goals_against, and win/draw/loss outcome columns attached.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Match-level DataFrame with homeTeamId, awayTeamId, homeScore, awayScore, outcome_1x2, and optional context columns. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame sorted by (teamId, startTimeUtc, id) with one row |
DataFrame
|
per team-match combination. |
Source code in src/features/stats_matches.py
add_rolling_features(df_team_match, group_keys, windows, stats_cols, prefix)
¶
Add lagged rolling-mean features to a team-match DataFrame.
For each window size, computes the rolling sum of previous matches (shifted by one to prevent leakage) and divides by the rolling count to produce per-column means. A coverage column (fraction of window filled) is also appended.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_team_match
|
DataFrame
|
Team-match-level DataFrame (output of
|
required |
group_keys
|
list[str]
|
Column(s) to group by before rolling (e.g.
|
required |
windows
|
list[int]
|
List of look-back window sizes (e.g. |
required |
stats_cols
|
list[str]
|
Columns to aggregate with rolling mean. |
required |
prefix
|
str
|
Column-name prefix for all generated features. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
Copy of |
DataFrame
|
|
DataFrame
|
|
Source code in src/features/stats_matches.py
to_match_level(df_team_match, leaky_cols)
¶
Pivot a team-match DataFrame back to match-level with home/away/diff columns.
Numeric columns (excluding leaky_cols) are split into
home_{col} / away_{col} views and joined on id.
Difference columns diff_{col} = home_{col} - away_{col} are
appended for each numeric feature.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_team_match
|
DataFrame
|
Team-match-level DataFrame with an |
required |
leaky_cols
|
set
|
Set of column names to exclude from the pivot (e.g. outcome or raw score columns that must not appear as features). |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame indexed by match |
DataFrame
|
prefixed numeric feature columns. |
Source code in src/features/stats_matches.py
parse_feature(col)
¶
Parse a feature column name into its metadata components.
Expected format::
{side}_{scope}_{metric}_{agg}_w{window}
e.g. home_all_win_mean_w3 → side="home", scope="all",
metric="win", agg="mean", window=3.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
col
|
str
|
Feature column name following the project naming convention. |
required |
Returns:
| Type | Description |
|---|---|
Dict with keys
|
name, side, scope, metric, agg, window. |