Skip to content

Features

Feature Selector

Shared feature-column selector for all model-training and serving stages.

Every stage that trains, evaluates, or serves predictions must use select_model_features to guarantee a consistent feature contract between classification, tuning, final training, ablation, and batch inference.

Feature columns are derived from features_meta.parquet, which is produced by the feature_engineering DVC stage and is the single source of truth for available features.

select_model_features(features_meta, side, window_sizes, include_elo=True, include_rest_days=True, include_h2h=False)

Return numeric feature column names matching the given profile.

Parameters

features_meta: features_meta.parquet loaded as a DataFrame. Expected columns: name, side, scope, metric, window. side: Which perspective(s) to select: "home", "away", or "diff". Accepts a list of strings, e.g. ["diff"] or ["home", "away", "diff"]. window_sizes: Rolling windows to include for windowed stat features (e.g. [1, 3]). Pass an empty list to exclude windowed stats (e.g. for elo_only). include_elo: Include ELO pre-match ratings (metric == "elo_pre"). include_rest_days: Include rest-days features (metric == "rest_days"). Currently not present in features_meta; kept for forward compatibility. include_h2h: Include head-to-head features (scope == "h2h"). Currently not present in features_meta; kept for forward compatibility.

Returns

list[str] Ordered list of numeric feature column names.

Source code in src/features/select.py
def select_model_features(
    features_meta: pd.DataFrame,
    side: list[str],
    window_sizes: list[int],
    include_elo: bool = True,
    include_rest_days: bool = True,
    include_h2h: bool = False,
) -> list[str]:
    """Return numeric feature column names matching the given profile.

    Parameters
    ----------
    features_meta:
        ``features_meta.parquet`` loaded as a DataFrame.
        Expected columns: ``name``, ``side``, ``scope``, ``metric``, ``window``.
    side:
        Which perspective(s) to select: ``"home"``, ``"away"``, or ``"diff"``.
        Accepts a list of strings, e.g. ``["diff"]`` or
        ``["home", "away", "diff"]``.
    window_sizes:
        Rolling windows to include for windowed stat features (e.g. ``[1, 3]``).
        Pass an empty list to exclude windowed stats (e.g. for ``elo_only``).
    include_elo:
        Include ELO pre-match ratings (``metric == "elo_pre"``).
    include_rest_days:
        Include rest-days features (``metric == "rest_days"``).
        Currently not present in ``features_meta``; kept for forward compatibility.
    include_h2h:
        Include head-to-head features (``scope == "h2h"``).
        Currently not present in ``features_meta``; kept for forward compatibility.

    Returns
    -------
    list[str]
        Ordered list of numeric feature column names.
    """
    side_mask: pd.Series = features_meta["side"].isin(side)
    window_mask: pd.Series = features_meta["window"].isin(window_sizes)

    extra_parts: list[pd.Series] = []
    if include_elo:
        extra_parts.append(features_meta["metric"].eq("elo_pre"))
    if include_rest_days:
        extra_parts.append(features_meta["metric"].eq("rest_days"))
    if include_h2h:
        extra_parts.append(features_meta["scope"].eq("h2h"))

    if extra_parts:
        extra_mask: pd.Series = reduce(operator.or_, extra_parts)
        combined = side_mask & (window_mask | extra_mask)
    else:
        combined = side_mask & window_mask

    return features_meta[combined]["name"].tolist()

ELO Ratings

ELO rating features for football match prediction.

ELO ratings are computed per-tournament, propagating forward in time. The value attached to each match row is the PRE-match rating (before the result is known), which guarantees zero data leakage.

Implementation notes

  • Ratings are maintained per (tournament, team) pair so that clubs that play in multiple competitions carry independent ratings.
  • Home advantage is modelled as a fixed additive bonus to the home team's expected score calculation, not as a separate rating.
  • The algorithm is intentionally sequential (Python loop) because ELO has a strict temporal dependency. Vectorisation is not applicable here.

References

Elo, A. E. (1978). The Rating of Chessplayers, Past and Present. Arco Publishing. Football application overview: https://www.eloratings.net/about

compute_elo_ratings(df, k_factor=_DEFAULT_K, initial_rating=_DEFAULT_INITIAL, home_advantage=_DEFAULT_HOME_ADV, group_col='tournamentId')

Compute pre-match ELO ratings and attach them to each match row.

Parameters

df: Match-level DataFrame. Required columns: id, startTimeUtc, homeTeamId, awayTeamId, outcome_1x2, and group_col. Rows corresponding to future matches (outcome_1x2 is NaN) receive pre-match ELO values but do not trigger a rating update. k_factor: ELO update step size. Larger K means faster adaptation to recent results. Typical range: 20–40. initial_rating: Starting ELO assigned to teams with no prior history in the group. home_advantage: Rating-point bonus applied to the home team's expected score. Set to 0 to disable. group_col: Column used to scope ELO state. Default tournamentId keeps league-specific rating histories independent.

Returns

pd.DataFrame A copy of df with three new columns:

- ``home_elo_pre``  – home team ELO before this match (float32)
- ``away_elo_pre``  – away team ELO before this match (float32)
- ``diff_elo_pre``  – ``home_elo_pre`` − ``away_elo_pre`` (float32)

Notes

The pre-match snapshot ensures the features contain no information about the result being predicted — use-after-fit leakage cannot occur.

Source code in src/features/elo.py
def compute_elo_ratings(
    df: pd.DataFrame,
    k_factor: float = _DEFAULT_K,
    initial_rating: float = _DEFAULT_INITIAL,
    home_advantage: float = _DEFAULT_HOME_ADV,
    group_col: str = "tournamentId",
) -> pd.DataFrame:
    """Compute pre-match ELO ratings and attach them to each match row.

    Parameters
    ----------
    df:
        Match-level DataFrame.  Required columns: ``id``, ``startTimeUtc``,
        ``homeTeamId``, ``awayTeamId``, ``outcome_1x2``, and ``group_col``.
        Rows corresponding to *future* matches (``outcome_1x2`` is NaN)
        receive pre-match ELO values but do not trigger a rating update.
    k_factor:
        ELO update step size.  Larger K means faster adaptation to recent
        results.  Typical range: 20–40.
    initial_rating:
        Starting ELO assigned to teams with no prior history in the group.
    home_advantage:
        Rating-point bonus applied to the home team's expected score.
        Set to 0 to disable.
    group_col:
        Column used to scope ELO state.  Default ``tournamentId`` keeps
        league-specific rating histories independent.

    Returns
    -------
    pd.DataFrame
        A copy of ``df`` with three new columns:

        - ``home_elo_pre``  – home team ELO before this match (float32)
        - ``away_elo_pre``  – away team ELO before this match (float32)
        - ``diff_elo_pre``  – ``home_elo_pre`` − ``away_elo_pre`` (float32)

    Notes
    -----
    The *pre-match* snapshot ensures the features contain no information
    about the result being predicted — use-after-fit leakage cannot occur.
    """
    required = {
        "id",
        "startTimeUtc",
        "homeTeamId",
        "awayTeamId",
        "outcome_1x2",
        group_col,
    }
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f"compute_elo_ratings: missing columns {missing}")

    # Normalise: drop any named index so 'id' exists only as a column.
    # Parquet round-trips can leave 'id' as both index and column simultaneously.
    df = df.reset_index(drop=True).copy()
    if "id" not in df.columns:
        raise ValueError("compute_elo_ratings: 'id' column not found after index reset")

    df_sorted = df.sort_values(["startTimeUtc", "id"]).reset_index(drop=True)

    # ratings[group_value][team_id] = current_elo
    ratings: dict = defaultdict(lambda: defaultdict(lambda: initial_rating))

    home_elo_pre: list[float] = []
    away_elo_pre: list[float] = []

    for row in df_sorted.itertuples(index=False):
        group = getattr(row, group_col)
        home_id: int = row.homeTeamId
        away_id: int = row.awayTeamId

        # Snapshot PRE-match ratings — these become the feature values.
        r_home: float = ratings[group][home_id]
        r_away: float = ratings[group][away_id]
        home_elo_pre.append(r_home)
        away_elo_pre.append(r_away)

        # Update ratings only for completed matches (outcome is known).
        outcome = row.outcome_1x2
        if pd.isna(outcome):
            continue

        # Skip self-play rows: a team cannot face itself.
        if home_id == away_id:
            logger.warning("Skipping ELO update for self-play match id=%s", row.id)
            continue

        exp_home = _expected_score(r_home + home_advantage, r_away)
        score_home = _outcome_score_home(int(outcome))

        ratings[group][home_id] = r_home + k_factor * (score_home - exp_home)
        ratings[group][away_id] = r_away + k_factor * (
            (1.0 - score_home) - (1.0 - exp_home)
        )

    result = df_sorted.copy()
    result["home_elo_pre"] = np.array(home_elo_pre, dtype=np.float32)
    result["away_elo_pre"] = np.array(away_elo_pre, dtype=np.float32)
    result["diff_elo_pre"] = (result["home_elo_pre"] - result["away_elo_pre"]).astype(
        np.float32
    )

    logger.info(
        "ELO computed: %d matches, k_factor=%.1f, home_adv=%.1f, scope=%r",
        len(df_sorted),
        k_factor,
        home_advantage,
        group_col,
    )

    return result

elo_feature_meta(side='diff')

Return feature metadata rows compatible with the rolling-features schema.

Parameters

side: Which perspective to surface: "home", "away", or "diff". Passing "diff" returns only the home-minus-away column, which is what the default classification pipeline uses.

Returns

list of dicts with keys: name, side, scope, metric, agg, window.

Source code in src/features/elo.py
def elo_feature_meta(side: str = "diff") -> list[dict]:
    """Return feature metadata rows compatible with the rolling-features schema.

    Parameters
    ----------
    side:
        Which perspective to surface: ``"home"``, ``"away"``, or ``"diff"``.
        Passing ``"diff"`` returns only the home-minus-away column, which is
        what the default classification pipeline uses.

    Returns
    -------
    list of dicts with keys: name, side, scope, metric, agg, window.
    """
    candidates = [
        {
            "name": "home_elo_pre",
            "side": "home",
            "scope": "tournament",
            "metric": "elo_pre",
            "agg": "elo",
            "window": 0,
        },
        {
            "name": "away_elo_pre",
            "side": "away",
            "scope": "tournament",
            "metric": "elo_pre",
            "agg": "elo",
            "window": 0,
        },
        {
            "name": "diff_elo_pre",
            "side": "diff",
            "scope": "tournament",
            "metric": "elo_pre",
            "agg": "elo",
            "window": 0,
        },
    ]
    if side == "diff":
        return [r for r in candidates if r["side"] == "diff"]
    return candidates

Rolling Match Statistics

build_team_match_table(df)

Reshape a match-level DataFrame into a team-match-level table.

Each match in df produces two rows — one for the home team and one for the away team — with is_home, goals_for, goals_against, and win/draw/loss outcome columns attached.

Parameters:

Name Type Description Default
df DataFrame

Match-level DataFrame with homeTeamId, awayTeamId, homeScore, awayScore, outcome_1x2, and optional context columns.

required

Returns:

Type Description
DataFrame

DataFrame sorted by (teamId, startTimeUtc, id) with one row

DataFrame

per team-match combination.

Source code in src/features/stats_matches.py
def build_team_match_table(df: pd.DataFrame) -> pd.DataFrame:
    """Reshape a match-level DataFrame into a team-match-level table.

    Each match in ``df`` produces two rows — one for the home team and
    one for the away team — with ``is_home``, ``goals_for``,
    ``goals_against``, and win/draw/loss outcome columns attached.

    Args:
        df: Match-level DataFrame with homeTeamId, awayTeamId,
            homeScore, awayScore, outcome_1x2, and optional context
            columns.

    Returns:
        DataFrame sorted by (teamId, startTimeUtc, id) with one row
        per team-match combination.
    """
    common_cols = [
        "id",
        "startTimeUtc",
        "tournamentId",
        "stageId",
        "regionId",
        "seasonId",
    ]
    common_cols = [c for c in common_cols if c in df.columns]

    df_base = df[common_cols].copy()

    df_home = df_base.copy()
    df_home["teamId"] = df["homeTeamId"].values
    df_home["opponentId"] = df["awayTeamId"].values
    df_home["is_home"] = True
    df_home["goals_for"] = df["homeScore"].values
    df_home["goals_against"] = df["awayScore"].values
    df_home["win"] = df["outcome_1x2"].values == 0
    df_home["draw"] = df["outcome_1x2"].values == 1
    df_home["loss"] = df["outcome_1x2"].values == 2

    df_away = df_base.copy()
    df_away["teamId"] = df["awayTeamId"].values
    df_away["opponentId"] = df["homeTeamId"].values
    df_away["is_home"] = False
    df_away["goals_for"] = df["awayScore"].values
    df_away["goals_against"] = df["homeScore"].values
    df_away["win"] = df["outcome_1x2"].values == 2
    df_away["draw"] = df["outcome_1x2"].values == 1
    df_away["loss"] = df["outcome_1x2"].values == 0

    df_team_match = pd.concat([df_home, df_away], axis=0, ignore_index=True)
    df_team_match = df_team_match.sort_values(
        ["teamId", "startTimeUtc", "id"], ascending=[True, True, True]
    )

    return df_team_match

add_rolling_features(df_team_match, group_keys, windows, stats_cols, prefix)

Add lagged rolling-mean features to a team-match DataFrame.

For each window size, computes the rolling sum of previous matches (shifted by one to prevent leakage) and divides by the rolling count to produce per-column means. A coverage column (fraction of window filled) is also appended.

Parameters:

Name Type Description Default
df_team_match DataFrame

Team-match-level DataFrame (output of build_team_match_table).

required
group_keys list[str]

Column(s) to group by before rolling (e.g. ["teamId"] or ["teamId", "tournamentId"]).

required
windows list[int]

List of look-back window sizes (e.g. [1, 3, 5]).

required
stats_cols list[str]

Columns to aggregate with rolling mean.

required
prefix str

Column-name prefix for all generated features.

required

Returns:

Type Description
DataFrame

Copy of df_team_match with new columns

DataFrame

{prefix}_{col}_mean_w{window} and

DataFrame

{prefix}_coverage_w{window}.

Source code in src/features/stats_matches.py
def add_rolling_features(
    df_team_match: pd.DataFrame,
    group_keys: list[str],
    windows: list[int],
    stats_cols: list[str],
    prefix: str,
) -> pd.DataFrame:
    """Add lagged rolling-mean features to a team-match DataFrame.

    For each window size, computes the rolling sum of *previous* matches
    (shifted by one to prevent leakage) and divides by the rolling
    count to produce per-column means.  A coverage column (fraction of
    window filled) is also appended.

    Args:
        df_team_match: Team-match-level DataFrame (output of
            ``build_team_match_table``).
        group_keys: Column(s) to group by before rolling (e.g.
            ``["teamId"]`` or ``["teamId", "tournamentId"]``).
        windows: List of look-back window sizes (e.g. ``[1, 3, 5]``).
        stats_cols: Columns to aggregate with rolling mean.
        prefix: Column-name prefix for all generated features.

    Returns:
        Copy of ``df_team_match`` with new columns
        ``{prefix}_{col}_mean_w{window}`` and
        ``{prefix}_coverage_w{window}``.
    """
    df_team_match = df_team_match.copy()

    df_grouped = df_team_match.groupby(group_keys, sort=False)

    df_shifted = df_grouped[stats_cols].shift(1)

    df_count_shifted = df_grouped[stats_cols[0]].shift(1)

    for window in windows:
        roll_sum = (
            df_shifted.groupby(
                df_team_match[group_keys].apply(tuple, axis=1)
                if len(group_keys) > 1
                else df_team_match[group_keys[0]]
            )
            .rolling(window, min_periods=1)
            .sum()
            .reset_index(level=0, drop=True)
        )

        roll_cnt = (
            df_count_shifted.groupby(
                df_team_match[group_keys].apply(tuple, axis=1)
                if len(group_keys) > 1
                else df_team_match[group_keys[0]]
            )
            .rolling(window, min_periods=1)
            .count()
            .reset_index(level=0, drop=True)
        )

        for col in stats_cols:
            df_team_match[f"{prefix}_{col}_mean_w{window}"] = (
                roll_sum[col] / roll_cnt
            ).astype(np.float32)

        df_team_match[f"{prefix}_coverage_w{window}"] = (roll_cnt / window).astype(
            np.float32
        )

    logger.debug(
        "Added rolling features: prefix=%r, windows=%s, stats=%s",
        prefix,
        windows,
        stats_cols,
    )

    return df_team_match

to_match_level(df_team_match, leaky_cols)

Pivot a team-match DataFrame back to match-level with home/away/diff columns.

Numeric columns (excluding leaky_cols) are split into home_{col} / away_{col} views and joined on id. Difference columns diff_{col} = home_{col} - away_{col} are appended for each numeric feature.

Parameters:

Name Type Description Default
df_team_match DataFrame

Team-match-level DataFrame with an id column and an is_home flag.

required
leaky_cols set

Set of column names to exclude from the pivot (e.g. outcome or raw score columns that must not appear as features).

required

Returns:

Type Description
DataFrame

DataFrame indexed by match id with home_, away_, and diff_

DataFrame

prefixed numeric feature columns.

Source code in src/features/stats_matches.py
def to_match_level(df_team_match: pd.DataFrame, leaky_cols: set) -> pd.DataFrame:
    """Pivot a team-match DataFrame back to match-level with home/away/diff columns.

    Numeric columns (excluding ``leaky_cols``) are split into
    ``home_{col}`` / ``away_{col}`` views and joined on ``id``.
    Difference columns ``diff_{col} = home_{col} - away_{col}`` are
    appended for each numeric feature.

    Args:
        df_team_match: Team-match-level DataFrame with an ``id`` column
            and an ``is_home`` flag.
        leaky_cols: Set of column names to exclude from the pivot (e.g.
            outcome or raw score columns that must not appear as
            features).

    Returns:
        DataFrame indexed by match ``id`` with home_, away_, and diff_
        prefixed numeric feature columns.
    """
    candidate_cols = [c for c in df_team_match.columns if c not in leaky_cols]

    numeric_cols = (
        df_team_match[candidate_cols]
        .select_dtypes(include=[np.number])
        .columns.tolist()
    )
    home = (
        df_team_match[df_team_match["is_home"] == 1][["id"] + numeric_cols]
        .set_index("id")
        .add_prefix("home_")
    )
    away = (
        df_team_match[df_team_match["is_home"] == 0][["id"] + numeric_cols]
        .set_index("id")
        .add_prefix("away_")
    )

    df_feat = home.join(away, how="inner")

    diff_cols = {
        f"diff_{c}": df_feat[f"home_{c}"] - df_feat[f"away_{c}"] for c in numeric_cols
    }
    df_feat = pd.concat([df_feat, pd.DataFrame(diff_cols, index=df_feat.index)], axis=1)

    return df_feat

parse_feature(col)

Parse a feature column name into its metadata components.

Expected format::

{side}_{scope}_{metric}_{agg}_w{window}

e.g. home_all_win_mean_w3 → side="home", scope="all", metric="win", agg="mean", window=3.

Parameters:

Name Type Description Default
col str

Feature column name following the project naming convention.

required

Returns:

Type Description
Dict with keys

name, side, scope, metric, agg, window.

Source code in src/features/stats_matches.py
def parse_feature(col: str) -> dict:
    """Parse a feature column name into its metadata components.

    Expected format::

        {side}_{scope}_{metric}_{agg}_w{window}

    e.g. ``home_all_win_mean_w3`` → side="home", scope="all",
    metric="win", agg="mean", window=3.

    Args:
        col: Feature column name following the project naming
            convention.

    Returns:
        Dict with keys: name, side, scope, metric, agg, window.
    """
    base, window_str = col.rsplit("_w", 1)
    window = int(window_str)

    base, agg = base.rsplit("_", 1)

    side, scope, *metric_parts = base.split("_")
    metric = "_".join(metric_parts)

    return {
        "name": col,
        "side": side,
        "scope": scope,
        "metric": metric,
        "agg": agg,
        "window": window,
    }