Feature Engineering & Offline/Online Parity¶
Purpose¶
Document the implemented feature families, how leakage is prevented at the feature level, how feature logic is shared between offline training and online serving, and what feature types are excluded and why.
Design constraints¶
Every feature in this system must satisfy all of the following before it can be used in training:
- Pre-match only — computable from data available before kick-off.
- Point-in-time correct — no information from the current match (or any future match) enters the feature.
- Deterministic — same input data + same
params.yaml→ same feature values. - Parity-safe — the same logic path is used at training time and at inference time.
A feature that violates any of these constraints is excluded. This is not a best-practice recommendation — it is a hard requirement enforced by tests.
Implemented feature families¶
1. Rolling match statistics¶
Source: src/features/stats_matches.py — build_team_match_table + add_rolling_features
Each match produces per-team rolling aggregates of recent performance. The pipeline first reshapes match-level data into a long team-match table, then computes rolling windows per team.
Metrics aggregated:
| Metric | Meaning |
|---|---|
win |
Win flag (1/0) |
draw |
Draw flag (1/0) |
loss |
Loss flag (1/0) |
goals_for |
Goals scored |
goals_against |
Goals conceded |
Window sizes (configurable via params.yaml → features.window_sizes): [1, 2, 3, 5, 7, 10, 12]
Stats are computed across 5 rolling scopes, each capturing a different slice of history:
| Scope | Description |
|---|---|
all |
All historical matches, cross-tournament and cross-season |
season |
Current season only — captures in-season form changes |
tournament |
Current tournament (league/cup) — removes cross-competition noise |
ha |
Home/Away split — stats computed separately for home and away contexts |
h2h |
Head-to-Head — rolling stats from prior meetings between the same two teams |
For each metric × window × scope, three columns are produced:
{side}_{scope}_{metric}_mean_w{window} where side ∈ {home, away, diff}
Example: home_all_win_mean_w3, diff_tournament_goals_for_mean_w5
A coverage column {side}_{scope}_coverage_w{window} tracks how many matches contributed to each
rolling value — important for teams with short history.
Leakage guard: shift(1) is applied to the team-match series before the rolling window.
Match N's feature uses only matches N−1, N−2, … — never match N itself.
All three sides (home, diff, away) are included in training
(classification.side: [home, diff, away]). The diff features capture the relative
strength signal; home/away features retain the absolute context.
2. ELO ratings¶
Source: src/features/elo.py — compute_elo_ratings
ELO ratings capture relative team strength within a tournament. Ratings are:
- Computed per tournamentId — each competition maintains independent state.
- Updated after each match; the value attached to a row is the pre-match rating.
- Scoped to the team's history in that tournament only (no cross-competition bleed).
Columns produced per match:
| Column | Meaning |
|---|---|
home_elo_pre |
Home team ELO before this match |
away_elo_pre |
Away team ELO before this match |
diff_elo_pre |
home_elo_pre − away_elo_pre |
ELO configuration (from params.yaml → features.elo):
- k_factor: 32.0 — update step size
- initial_rating: 1500.0 — rating for teams with no prior history
- home_advantage: 50.0 — additive bonus in expected-score calculation
Teams with no history in a tournament receive the initial_rating. The home advantage
factor is applied in the expected-score formula, not as a feature column.
3. Rest days¶
Source: src/features/stats_matches.py — add_rest_days
Days elapsed since each team's previous match, computed from startTimeUtc.
Captures fixture congestion and recovery time.
Columns produced: home_rest_days, away_rest_days, diff_rest_days
4. Head-to-head (H2H) statistics¶
H2H is the h2h scope within the rolling stats pipeline (see scope table above).
Rolling historical statistics between the two specific teams playing each other,
regardless of venue. Uses the same shift(1) + rolling approach as general stats.
Columns follow the standard naming: {side}_h2h_{metric}_mean_w{window}.
5. Categorical context¶
Source: params.yaml → classification.cat_cols
Currently: sex (men's vs. women's competition). Passed as a categorical column to the model.
Feature sides used in training¶
All three sides (home, diff, away) are included (classification.side: [home, diff, away]).
diff = home − away captures relative strength; home/away retain absolute context.
The final feature set is determined by features_meta.parquet.
Excluded feature types¶
| Feature type | Reason for exclusion |
|---|---|
| In-match events (goals scored, cards) | Post-kickoff data; strict pre-match cutoff violated |
| Player-level stats | Not available in current data source; planned future improvement |
| Bookmaker odds as input features | Clean separation of prediction from market data |
| Weather / pitch conditions | Not available in current source |
| Live standings / table position at match time | Requires careful point-in-time join; not yet implemented safely |
Offline/online parity¶
At training time: features are computed by the DVC feature_engineering stage and stored
in data/features/features.parquet. These are the features the model is trained on.
At inference time (batch): the DVC batch_inference stage runs the same feature code
(src/features/stats_matches.py, src/features/elo.py) on upcoming matches to produce
data/predictions/match_features.parquet. The serving layer reads from this artifact.
Parity is maintained because:
- The same source modules are used in both paths.
- No ad-hoc transformations are applied at inference.
- Feature column names and dtypes are recorded in features_meta.parquet and validated.
If parity cannot be guaranteed for a feature, that feature is excluded from the model. This is the governing rule for all serving-path decisions.
Feature metadata¶
data/features/features_meta.parquet records each feature's name, type (numeric/categorical),
and origin family. The training pipeline reads features_meta.parquet to determine X_cols,
num_cols, and cat_cols — no hardcoded column lists in model code.
Implementation status¶
| Feature family | Status |
|---|---|
| Rolling stats (win/draw/loss/goals) | ✅ Implemented |
| ELO ratings per tournament | ✅ Implemented |
| Rest days | ✅ Implemented |
| H2H rolling statistics | ✅ Implemented |
Categorical context (sex) |
✅ Implemented |
| Feature metadata contract | ✅ Implemented |
| Player-level features | 📋 Planned |
| Live standings join | 📋 Planned |
Related¶
- Validation Strategy — how features are tested for leakage
- Training Pipeline — how features flow into training
- Model Contract — how feature names are encoded in model signature
- Data: Schemas
- Serving — how batch features are used at inference