Features¶

Feature Selector¶

Shared feature-column selector for all model-training and serving stages.

Every stage that trains, evaluates, or serves predictions must use select_model_features to guarantee a consistent feature contract between classification, tuning, final training, ablation, and batch inference.

Feature columns are derived from features_meta.parquet, which is produced by the feature_engineering DVC stage and is the single source of truth for available features.

`select_model_features(features_meta, side, window_sizes, include_elo=True, include_rest_days=True, include_h2h=False)` ¶

Return numeric feature column names matching the given profile.

Parameters¶

features_meta: features_meta.parquet loaded as a DataFrame. Expected columns: name, side, scope, metric, window. side: Which perspective(s) to select: "home", "away", or "diff". Accepts a list of strings, e.g. ["diff"] or ["home", "away", "diff"]. window_sizes: Rolling windows to include for windowed stat features (e.g. [1, 3]). Pass an empty list to exclude windowed stats (e.g. for elo_only). include_elo: Include ELO pre-match ratings (metric == "elo_pre"). include_rest_days: Include rest-days features (metric == "rest_days"). Currently not present in features_meta; kept for forward compatibility. include_h2h: Include head-to-head features (scope == "h2h"). Currently not present in features_meta; kept for forward compatibility.

Returns¶

list[str] Ordered list of numeric feature column names.

Source code in src/features/select.py

def select_model_features(
    features_meta: pd.DataFrame,
    side: list[str],
    window_sizes: list[int],
    include_elo: bool = True,
    include_rest_days: bool = True,
    include_h2h: bool = False,
) -> list[str]:
    """Return numeric feature column names matching the given profile.

    Parameters
    ----------
    features_meta:
        ``features_meta.parquet`` loaded as a DataFrame.
        Expected columns: ``name``, ``side``, ``scope``, ``metric``, ``window``.
    side:
        Which perspective(s) to select: ``"home"``, ``"away"``, or ``"diff"``.
        Accepts a list of strings, e.g. ``["diff"]`` or
        ``["home", "away", "diff"]``.
    window_sizes:
        Rolling windows to include for windowed stat features (e.g. ``[1, 3]``).
        Pass an empty list to exclude windowed stats (e.g. for ``elo_only``).
    include_elo:
        Include ELO pre-match ratings (``metric == "elo_pre"``).
    include_rest_days:
        Include rest-days features (``metric == "rest_days"``).
        Currently not present in ``features_meta``; kept for forward compatibility.
    include_h2h:
        Include head-to-head features (``scope == "h2h"``).
        Currently not present in ``features_meta``; kept for forward compatibility.

    Returns
    -------
    list[str]
        Ordered list of numeric feature column names.
    """
    side_mask: pd.Series = features_meta["side"].isin(side)
    window_mask: pd.Series = features_meta["window"].isin(window_sizes)

    extra_parts: list[pd.Series] = []
    if include_elo:
        extra_parts.append(features_meta["metric"].eq("elo_pre"))
    if include_rest_days:
        extra_parts.append(features_meta["metric"].eq("rest_days"))
    if include_h2h:
        extra_parts.append(features_meta["scope"].eq("h2h"))

    if extra_parts:
        extra_mask: pd.Series = reduce(operator.or_, extra_parts)
        combined = side_mask & (window_mask | extra_mask)
    else:
        combined = side_mask & window_mask

    return features_meta[combined]["name"].tolist()

ELO Ratings¶

ELO rating features for football match prediction.

ELO ratings are computed per-tournament, propagating forward in time. The value attached to each match row is the PRE-match rating (before the result is known), which guarantees zero data leakage.

Implementation notes¶

Ratings are maintained per (tournament, team) pair so that clubs that play in multiple competitions carry independent ratings.
Home advantage is modelled as a fixed additive bonus to the home team's expected score calculation, not as a separate rating.
The algorithm is intentionally sequential (Python loop) because ELO has a strict temporal dependency. Vectorisation is not applicable here.

References¶

Elo, A. E. (1978). The Rating of Chessplayers, Past and Present. Arco Publishing. Football application overview: https://www.eloratings.net/about

`compute_elo_ratings(df, k_factor=_DEFAULT_K, initial_rating=_DEFAULT_INITIAL, home_advantage=_DEFAULT_HOME_ADV, group_col='tournamentId')` ¶

Compute pre-match ELO ratings and attach them to each match row.

Parameters¶

df: Match-level DataFrame. Required columns: id, startTimeUtc, homeTeamId, awayTeamId, outcome_1x2, and group_col. Rows corresponding to future matches (outcome_1x2 is NaN) receive pre-match ELO values but do not trigger a rating update. k_factor: ELO update step size. Larger K means faster adaptation to recent results. Typical range: 20–40. initial_rating: Starting ELO assigned to teams with no prior history in the group. home_advantage: Rating-point bonus applied to the home team's expected score. Set to 0 to disable. group_col: Column used to scope ELO state. Default tournamentId keeps league-specific rating histories independent.

Returns¶

pd.DataFrame A copy of df with three new columns:

- ``home_elo_pre``  – home team ELO before this match (float32)
- ``away_elo_pre``  – away team ELO before this match (float32)
- ``diff_elo_pre``  – ``home_elo_pre`` − ``away_elo_pre`` (float32)

Notes¶

The pre-match snapshot ensures the features contain no information about the result being predicted — use-after-fit leakage cannot occur.

Source code in src/features/elo.py

def compute_elo_ratings(
    df: pd.DataFrame,
    k_factor: float = _DEFAULT_K,
    initial_rating: float = _DEFAULT_INITIAL,
    home_advantage: float = _DEFAULT_HOME_ADV,
    group_col: str = "tournamentId",
) -> pd.DataFrame:
    """Compute pre-match ELO ratings and attach them to each match row.

    Parameters
    ----------
    df:
        Match-level DataFrame.  Required columns: ``id``, ``startTimeUtc``,
        ``homeTeamId``, ``awayTeamId``, ``outcome_1x2``, and ``group_col``.
        Rows corresponding to *future* matches (``outcome_1x2`` is NaN)
        receive pre-match ELO values but do not trigger a rating update.
    k_factor:
        ELO update step size.  Larger K means faster adaptation to recent
        results.  Typical range: 20–40.
    initial_rating:
        Starting ELO assigned to teams with no prior history in the group.
    home_advantage:
        Rating-point bonus applied to the home team's expected score.
        Set to 0 to disable.
    group_col:
        Column used to scope ELO state.  Default ``tournamentId`` keeps
        league-specific rating histories independent.

    Returns
    -------
    pd.DataFrame
        A copy of ``df`` with three new columns:

        - ``home_elo_pre``  – home team ELO before this match (float32)
        - ``away_elo_pre``  – away team ELO before this match (float32)
        - ``diff_elo_pre``  – ``home_elo_pre`` − ``away_elo_pre`` (float32)

    Notes
    -----
    The *pre-match* snapshot ensures the features contain no information
    about the result being predicted — use-after-fit leakage cannot occur.
    """
    required = {
        "id",
        "startTimeUtc",
        "homeTeamId",
        "awayTeamId",
        "outcome_1x2",
        group_col,
    }
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f"compute_elo_ratings: missing columns {missing}")

    # Normalise: drop any named index so 'id' exists only as a column.
    # Parquet round-trips can leave 'id' as both index and column simultaneously.
    df = df.reset_index(drop=True).copy()
    if "id" not in df.columns:
        raise ValueError("compute_elo_ratings: 'id' column not found after index reset")

    df_sorted = df.sort_values(["startTimeUtc", "id"]).reset_index(drop=True)

    # ratings[group_value][team_id] = current_elo
    ratings: dict = defaultdict(lambda: defaultdict(lambda: initial_rating))

    home_elo_pre: list[float] = []
    away_elo_pre: list[float] = []

    for row in df_sorted.itertuples(index=False):
        group = getattr(row, group_col)
        home_id: int = row.homeTeamId
        away_id: int = row.awayTeamId

        # Snapshot PRE-match ratings — these become the feature values.
        r_home: float = ratings[group][home_id]
        r_away: float = ratings[group][away_id]
        home_elo_pre.append(r_home)
        away_elo_pre.append(r_away)

        # Update ratings only for completed matches (outcome is known).
        outcome = row.outcome_1x2
        if pd.isna(outcome):
            continue

        # Skip self-play rows: a team cannot face itself.
        if home_id == away_id:
            logger.warning("Skipping ELO update for self-play match id=%s", row.id)
            continue

        exp_home = _expected_score(r_home + home_advantage, r_away)
        score_home = _outcome_score_home(int(outcome))

        ratings[group][home_id] = r_home + k_factor * (score_home - exp_home)
        ratings[group][away_id] = r_away + k_factor * (
            (1.0 - score_home) - (1.0 - exp_home)
        )

    result = df_sorted.copy()
    result["home_elo_pre"] = np.array(home_elo_pre, dtype=np.float32)
    result["away_elo_pre"] = np.array(away_elo_pre, dtype=np.float32)
    result["diff_elo_pre"] = (result["home_elo_pre"] - result["away_elo_pre"]).astype(
        np.float32
    )

    logger.info(
        "ELO computed: %d matches, k_factor=%.1f, home_adv=%.1f, scope=%r",
        len(df_sorted),
        k_factor,
        home_advantage,
        group_col,
    )

    return result

`elo_feature_meta(side='diff')` ¶

Return feature metadata rows compatible with the rolling-features schema.

Parameters¶

side: Which perspective to surface: "home", "away", or "diff". Passing "diff" returns only the home-minus-away column, which is what the default classification pipeline uses.

Returns¶

list of dicts with keys: name, side, scope, metric, agg, window.

Source code in src/features/elo.py

def elo_feature_meta(side: str = "diff") -> list[dict]:
    """Return feature metadata rows compatible with the rolling-features schema.

    Parameters
    ----------
    side:
        Which perspective to surface: ``"home"``, ``"away"``, or ``"diff"``.
        Passing ``"diff"`` returns only the home-minus-away column, which is
        what the default classification pipeline uses.

    Returns
    -------
    list of dicts with keys: name, side, scope, metric, agg, window.
    """
    candidates = [
        {
            "name": "home_elo_pre",
            "side": "home",
            "scope": "tournament",
            "metric": "elo_pre",
            "agg": "elo",
            "window": 0,
        },
        {
            "name": "away_elo_pre",
            "side": "away",
            "scope": "tournament",
            "metric": "elo_pre",
            "agg": "elo",
            "window": 0,
        },
        {
            "name": "diff_elo_pre",
            "side": "diff",
            "scope": "tournament",
            "metric": "elo_pre",
            "agg": "elo",
            "window": 0,
        },
    ]
    if side == "diff":
        return [r for r in candidates if r["side"] == "diff"]
    return candidates

Rolling Match Statistics¶

`build_team_match_table(df)` ¶

Reshape a match-level DataFrame into a team-match-level table.

Each match in df produces two rows — one for the home team and one for the away team — with is_home, goals_for, goals_against, and win/draw/loss outcome columns attached.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Match-level DataFrame with homeTeamId, awayTeamId, homeScore, awayScore, outcome_1x2, and optional context columns.	required

Returns:

Type	Description
`DataFrame`	DataFrame sorted by (teamId, startTimeUtc, id) with one row
`DataFrame`	per team-match combination.

Source code in src/features/stats_matches.py

def build_team_match_table(df: pd.DataFrame) -> pd.DataFrame:
    """Reshape a match-level DataFrame into a team-match-level table.

    Each match in ``df`` produces two rows — one for the home team and
    one for the away team — with ``is_home``, ``goals_for``,
    ``goals_against``, and win/draw/loss outcome columns attached.

    Args:
        df: Match-level DataFrame with homeTeamId, awayTeamId,
            homeScore, awayScore, outcome_1x2, and optional context
            columns.

    Returns:
        DataFrame sorted by (teamId, startTimeUtc, id) with one row
        per team-match combination.
    """
    common_cols = [
        "id",
        "startTimeUtc",
        "tournamentId",
        "stageId",
        "regionId",
        "seasonId",
    ]
    common_cols = [c for c in common_cols if c in df.columns]

    df_base = df[common_cols].copy()

    df_home = df_base.copy()
    df_home["teamId"] = df["homeTeamId"].values
    df_home["opponentId"] = df["awayTeamId"].values
    df_home["is_home"] = True
    df_home["goals_for"] = df["homeScore"].values
    df_home["goals_against"] = df["awayScore"].values
    df_home["win"] = df["outcome_1x2"].values == 0
    df_home["draw"] = df["outcome_1x2"].values == 1
    df_home["loss"] = df["outcome_1x2"].values == 2

    df_away = df_base.copy()
    df_away["teamId"] = df["awayTeamId"].values
    df_away["opponentId"] = df["homeTeamId"].values
    df_away["is_home"] = False
    df_away["goals_for"] = df["awayScore"].values
    df_away["goals_against"] = df["homeScore"].values
    df_away["win"] = df["outcome_1x2"].values == 2
    df_away["draw"] = df["outcome_1x2"].values == 1
    df_away["loss"] = df["outcome_1x2"].values == 0

    df_team_match = pd.concat([df_home, df_away], axis=0, ignore_index=True)
    df_team_match = df_team_match.sort_values(
        ["teamId", "startTimeUtc", "id"], ascending=[True, True, True]
    )

    return df_team_match

`add_rolling_features(df_team_match, group_keys, windows, stats_cols, prefix)` ¶

Add lagged rolling-mean features to a team-match DataFrame.

For each window size, computes the rolling sum of previous matches (shifted by one to prevent leakage) and divides by the rolling count to produce per-column means. A coverage column (fraction of window filled) is also appended.

Parameters:

Name	Type	Description	Default
`df_team_match`	`DataFrame`	Team-match-level DataFrame (output of `build_team_match_table`).	required
`group_keys`	`list[str]`	Column(s) to group by before rolling (e.g. `["teamId"]` or `["teamId", "tournamentId"]`).	required
`windows`	`list[int]`	List of look-back window sizes (e.g. `[1, 3, 5]`).	required
`stats_cols`	`list[str]`	Columns to aggregate with rolling mean.	required
`prefix`	`str`	Column-name prefix for all generated features.	required

Returns:

Type	Description
`DataFrame`	Copy of `df_team_match` with new columns
`DataFrame`	`{prefix}_{col}_mean_w{window}` and
`DataFrame`	`{prefix}_coverage_w{window}`.

Source code in src/features/stats_matches.py

def add_rolling_features(
    df_team_match: pd.DataFrame,
    group_keys: list[str],
    windows: list[int],
    stats_cols: list[str],
    prefix: str,
) -> pd.DataFrame:
    """Add lagged rolling-mean features to a team-match DataFrame.

    For each window size, computes the rolling sum of *previous* matches
    (shifted by one to prevent leakage) and divides by the rolling
    count to produce per-column means.  A coverage column (fraction of
    window filled) is also appended.

    Args:
        df_team_match: Team-match-level DataFrame (output of
            ``build_team_match_table``).
        group_keys: Column(s) to group by before rolling (e.g.
            ``["teamId"]`` or ``["teamId", "tournamentId"]``).
        windows: List of look-back window sizes (e.g. ``[1, 3, 5]``).
        stats_cols: Columns to aggregate with rolling mean.
        prefix: Column-name prefix for all generated features.

    Returns:
        Copy of ``df_team_match`` with new columns
        ``{prefix}_{col}_mean_w{window}`` and
        ``{prefix}_coverage_w{window}``.
    """
    df_team_match = df_team_match.copy()

    df_grouped = df_team_match.groupby(group_keys, sort=False)

    df_shifted = df_grouped[stats_cols].shift(1)

    df_count_shifted = df_grouped[stats_cols[0]].shift(1)

    for window in windows:
        roll_sum = (
            df_shifted.groupby(
                df_team_match[group_keys].apply(tuple, axis=1)
                if len(group_keys) > 1
                else df_team_match[group_keys[0]]
            )
            .rolling(window, min_periods=1)
            .sum()
            .reset_index(level=0, drop=True)
        )

        roll_cnt = (
            df_count_shifted.groupby(
                df_team_match[group_keys].apply(tuple, axis=1)
                if len(group_keys) > 1
                else df_team_match[group_keys[0]]
            )
            .rolling(window, min_periods=1)
            .count()
            .reset_index(level=0, drop=True)
        )

        for col in stats_cols:
            df_team_match[f"{prefix}_{col}_mean_w{window}"] = (
                roll_sum[col] / roll_cnt
            ).astype(np.float32)

        df_team_match[f"{prefix}_coverage_w{window}"] = (roll_cnt / window).astype(
            np.float32
        )

    logger.debug(
        "Added rolling features: prefix=%r, windows=%s, stats=%s",
        prefix,
        windows,
        stats_cols,
    )

    return df_team_match

`to_match_level(df_team_match, leaky_cols)` ¶

Pivot a team-match DataFrame back to match-level with home/away/diff columns.

Numeric columns (excluding leaky_cols) are split into home_{col} / away_{col} views and joined on id. Difference columns diff_{col} = home_{col} - away_{col} are appended for each numeric feature.

Parameters:

Name	Type	Description	Default
`df_team_match`	`DataFrame`	Team-match-level DataFrame with an `id` column and an `is_home` flag.	required
`leaky_cols`	`set`	Set of column names to exclude from the pivot (e.g. outcome or raw score columns that must not appear as features).	required

Returns:

Type	Description
`DataFrame`	DataFrame indexed by match `id` with home_, away_, and diff_
`DataFrame`	prefixed numeric feature columns.

Source code in src/features/stats_matches.py

def to_match_level(df_team_match: pd.DataFrame, leaky_cols: set) -> pd.DataFrame:
    """Pivot a team-match DataFrame back to match-level with home/away/diff columns.

    Numeric columns (excluding ``leaky_cols``) are split into
    ``home_{col}`` / ``away_{col}`` views and joined on ``id``.
    Difference columns ``diff_{col} = home_{col} - away_{col}`` are
    appended for each numeric feature.

    Args:
        df_team_match: Team-match-level DataFrame with an ``id`` column
            and an ``is_home`` flag.
        leaky_cols: Set of column names to exclude from the pivot (e.g.
            outcome or raw score columns that must not appear as
            features).

    Returns:
        DataFrame indexed by match ``id`` with home_, away_, and diff_
        prefixed numeric feature columns.
    """
    candidate_cols = [c for c in df_team_match.columns if c not in leaky_cols]

    numeric_cols = (
        df_team_match[candidate_cols]
        .select_dtypes(include=[np.number])
        .columns.tolist()
    )
    home = (
        df_team_match[df_team_match["is_home"] == 1][["id"] + numeric_cols]
        .set_index("id")
        .add_prefix("home_")
    )
    away = (
        df_team_match[df_team_match["is_home"] == 0][["id"] + numeric_cols]
        .set_index("id")
        .add_prefix("away_")
    )

    df_feat = home.join(away, how="inner")

    diff_cols = {
        f"diff_{c}": df_feat[f"home_{c}"] - df_feat[f"away_{c}"] for c in numeric_cols
    }
    df_feat = pd.concat([df_feat, pd.DataFrame(diff_cols, index=df_feat.index)], axis=1)

    return df_feat

`parse_feature(col)` ¶

Parse a feature column name into its metadata components.

Expected format::

{side}_{scope}_{metric}_{agg}_w{window}

e.g. home_all_win_mean_w3 → side="home", scope="all", metric="win", agg="mean", window=3.

Parameters:

Name	Type	Description	Default
`col`	`str`	Feature column name following the project naming convention.	required

Returns:

Type	Description
`Dict with keys`	name, side, scope, metric, agg, window.

Source code in src/features/stats_matches.py

def parse_feature(col: str) -> dict:
    """Parse a feature column name into its metadata components.

    Expected format::

        {side}_{scope}_{metric}_{agg}_w{window}

    e.g. ``home_all_win_mean_w3`` → side="home", scope="all",
    metric="win", agg="mean", window=3.

    Args:
        col: Feature column name following the project naming
            convention.

    Returns:
        Dict with keys: name, side, scope, metric, agg, window.
    """
    base, window_str = col.rsplit("_w", 1)
    window = int(window_str)

    base, agg = base.rsplit("_", 1)

    side, scope, *metric_parts = base.split("_")
    metric = "_".join(metric_parts)

    return {
        "name": col,
        "side": side,
        "scope": scope,
        "metric": metric,
        "agg": agg,
        "window": window,
    }

Features¶

Feature Selector¶

select_model_features(features_meta, side, window_sizes, include_elo=True, include_rest_days=True, include_h2h=False) ¶

Parameters¶

Returns¶

ELO Ratings¶

Implementation notes¶

References¶

compute_elo_ratings(df, k_factor=_DEFAULT_K, initial_rating=_DEFAULT_INITIAL, home_advantage=_DEFAULT_HOME_ADV, group_col='tournamentId') ¶

Parameters¶

Returns¶

Notes¶

elo_feature_meta(side='diff') ¶

Parameters¶

Returns¶

Rolling Match Statistics¶

build_team_match_table(df) ¶

add_rolling_features(df_team_match, group_keys, windows, stats_cols, prefix) ¶

to_match_level(df_team_match, leaky_cols) ¶

parse_feature(col) ¶

`select_model_features(features_meta, side, window_sizes, include_elo=True, include_rest_days=True, include_h2h=False)` ¶

`compute_elo_ratings(df, k_factor=_DEFAULT_K, initial_rating=_DEFAULT_INITIAL, home_advantage=_DEFAULT_HOME_ADV, group_col='tournamentId')` ¶

`elo_feature_meta(side='diff')` ¶

`build_team_match_table(df)` ¶

`add_rolling_features(df_team_match, group_keys, windows, stats_cols, prefix)` ¶

`to_match_level(df_team_match, leaky_cols)` ¶

`parse_feature(col)` ¶