Skip to content

Data Quality

Great Expectations suite builders — pure functions, no IO.

Raw Matches

Expectation suite for raw match data (match_raw.parquet).

Pure functions only — no IO, no side effects. All expectations operate on a pandas DataFrame passed at call time.

Raw data is the source-as-downloaded from MinIO before any preprocessing. It does NOT contain derived columns such as outcome_1x2 (computed in preprocessing).

build_raw_suite(context)

Build and return the expectation suite for raw match data.

Parameters:

Name Type Description Default
context AbstractDataContext

An active GX DataContext (ephemeral or file-based).

required

Returns:

Type Description
ExpectationSuite

An ExpectationSuite with all raw-data expectations added.

Source code in src/data_quality/raw.py
def build_raw_suite(context: AbstractDataContext) -> ExpectationSuite:
    """Build and return the expectation suite for raw match data.

    Args:
        context: An active GX DataContext (ephemeral or file-based).

    Returns:
        An ExpectationSuite with all raw-data expectations added.
    """
    suite = context.suites.add(ExpectationSuite(name="raw_match_suite"))

    # Schema
    # exact_match=False: allows upstream schema additions without breaking this stage.
    suite.add_expectation(
        gx.expectations.ExpectTableColumnsToMatchSet(
            column_set=REQUIRED_COLUMNS,
            exact_match=False,
        )
    )

    # Row count
    suite.add_expectation(gx.expectations.ExpectTableRowCountToBeBetween(min_value=1))

    # Nullability
    for col in NOT_NULL_COLUMNS:
        suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column=col))

    # Uniqueness
    # Duplicate match IDs would cause row duplication and leakage in all downstream stages.
    suite.add_expectation(gx.expectations.ExpectColumnValuesToBeUnique(column="id"))

    # Date range
    # Timestamps outside this range indicate a scraping bug or timezone error.
    suite.add_expectation(
        gx.expectations.ExpectColumnValuesToBeBetween(
            column="startTimeUtc",
            min_value=DATE_MIN,
            max_value=DATE_MAX,
        )
    )

    return suite

Interim (Preprocessed) Matches

Expectation suite for preprocessed finished matches (interim/finished.parquet).

Pure functions only — no IO, no side effects.

finished.parquet is the output of preprocessing: status=6 matches only, with computed targets (outcome_1x2, sumScore, diffScore) and dropped raw columns. Index is id (set in preprocess_and_split via df_matches.index = df_matches["id"]).

Key invariants vs. raw: - outcome_1x2 ∈ {0, 1, 2} (0=home win, 1=draw, 2=away win) - homeScore / awayScore are non-negative int8 values - startTimeUtc is tz-aware (UTC) and monotonically non-decreasing (sorted) - No nulls in any column (all null-bearing raw columns are dropped)

GX 1.12.3 known limitations: - ExpectColumnValuesToBeIncreasing is broken for tz-aware datetime columns; it attempts series_diff[null] = 1 which raises FutureError in pandas 2.x. - ExpectColumnValuesToBeBetween with tz-aware Timestamp bounds is unreliable. Both are omitted; the sort contract is enforced by preprocessing (sort_values) and temporal range is covered upstream by validate_raw.

build_interim_suite(context)

Build and return the expectation suite for preprocessed finished matches.

Parameters:

Name Type Description Default
context AbstractDataContext

An active GX DataContext (ephemeral or file-based).

required

Returns:

Type Description
ExpectationSuite

An ExpectationSuite with all interim-data expectations added.

Source code in src/data_quality/interim.py
def build_interim_suite(context: AbstractDataContext) -> ExpectationSuite:
    """Build and return the expectation suite for preprocessed finished matches.

    Args:
        context: An active GX DataContext (ephemeral or file-based).

    Returns:
        An ExpectationSuite with all interim-data expectations added.
    """
    suite = context.suites.add(ExpectationSuite(name="interim_finished_suite"))

    # Schema
    suite.add_expectation(
        gx.expectations.ExpectTableColumnsToMatchSet(
            column_set=REQUIRED_COLUMNS,
            exact_match=False,
        )
    )

    # Row count
    # Empty DataFrame means the status=6 filter removed everything —
    # scraping or API status-code issue, not a valid pipeline state.
    suite.add_expectation(gx.expectations.ExpectTableRowCountToBeBetween(min_value=1))

    # Nullability
    # Preprocessing drops all null-bearing columns; no nulls are expected.
    for col in NOT_NULL_COLUMNS:
        suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column=col))

    # Uniqueness
    suite.add_expectation(gx.expectations.ExpectColumnValuesToBeUnique(column="id"))

    # Target integrity
    # outcome_1x2 must only contain {0, 1, 2}. Any other value means the
    # argmax logic in preprocess_and_split is broken (e.g., all-False row).
    suite.add_expectation(
        gx.expectations.ExpectColumnValuesToBeInSet(
            column="outcome_1x2",
            value_set=OUTCOME_VALUES,
        )
    )

    # Score ranges
    # Scores are non-negative; upper bound guards against overflow/corruption.
    for col in ("homeScore", "awayScore"):
        suite.add_expectation(
            gx.expectations.ExpectColumnValuesToBeBetween(
                column=col,
                min_value=0,
                max_value=SCORE_MAX,
            )
        )

    # NOTE: ExpectColumnValuesToBeIncreasing on tz-aware datetime columns is
    # broken in GX 1.12.3 — internally it attempts `series_diff[null] = 1`
    # which raises a FutureError against timedelta64[ns] dtype in pandas 2.x.
    # The sort contract is enforced by preprocessing (df.sort_values) and
    # tested at the unit level in tests/test_data/test_preprocess.py.
    #
    # NOTE: ExpectColumnValuesToBeBetween with tz-aware Timestamp bounds is
    # also unreliable in GX 1.12.3. Date range integrity is implicitly
    # covered by the null check and the upstream raw validation (validate_raw).

    return suite

Finished Matches

Expectation suite for preprocessed finished matches (interim/finished.parquet).

Pure functions only — no IO, no side effects.

finished.parquet is the output of preprocessing: status=6 matches only, with computed targets (outcome_1x2, sumScore, diffScore) and dropped raw columns. Index is id (set in preprocess_and_split via df_matches.index = df_matches["id"]).

Key invariants vs. raw: - outcome_1x2 ∈ {0, 1, 2} (0=home win, 1=draw, 2=away win) - homeScore / awayScore are non-negative int8 values - startTimeUtc is tz-aware (UTC) and monotonically non-decreasing (sorted) - No nulls in any column (all null-bearing raw columns are dropped)

GX 1.12.3 known limitations: - ExpectColumnValuesToBeIncreasing is broken for tz-aware datetime columns; it attempts series_diff[null] = 1 which raises FutureError in pandas 2.x. - ExpectColumnValuesToBeBetween with tz-aware Timestamp bounds is unreliable. Both are omitted; the sort contract is enforced by preprocessing (sort_values) and temporal range is covered upstream by validate_raw.

build_finished_suite(context)

Build and return the expectation suite for preprocessed finished matches.

Parameters:

Name Type Description Default
context AbstractDataContext

An active GX DataContext (ephemeral or file-based).

required

Returns:

Type Description
ExpectationSuite

An ExpectationSuite with all finished-data expectations added.

Source code in src/data_quality/finished.py
def build_finished_suite(context: AbstractDataContext) -> ExpectationSuite:
    """Build and return the expectation suite for preprocessed finished matches.

    Args:
        context: An active GX DataContext (ephemeral or file-based).

    Returns:
        An ExpectationSuite with all finished-data expectations added.
    """
    suite = context.suites.add(ExpectationSuite(name="finished_suite"))

    # Schema
    suite.add_expectation(
        gx.expectations.ExpectTableColumnsToMatchSet(
            column_set=REQUIRED_COLUMNS,
            exact_match=False,
        )
    )

    # Row count
    # Empty DataFrame means the status=6 filter removed everything —
    # scraping or API status-code issue, not a valid pipeline state.
    suite.add_expectation(gx.expectations.ExpectTableRowCountToBeBetween(min_value=1))

    # Nullability
    # Preprocessing drops all null-bearing columns; no nulls are expected.
    for col in NOT_NULL_COLUMNS:
        suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column=col))

    # Uniqueness
    suite.add_expectation(gx.expectations.ExpectColumnValuesToBeUnique(column="id"))

    # Target integrity
    # outcome_1x2 must only contain {0, 1, 2}. Any other value means the
    # argmax logic in preprocess_and_split is broken (e.g., all-False row).
    suite.add_expectation(
        gx.expectations.ExpectColumnValuesToBeInSet(
            column="outcome_1x2",
            value_set=OUTCOME_VALUES,
        )
    )

    # Score ranges
    # Scores are non-negative; upper bound guards against overflow/corruption.
    for col in ("homeScore", "awayScore"):
        suite.add_expectation(
            gx.expectations.ExpectColumnValuesToBeBetween(
                column=col,
                min_value=0,
                max_value=SCORE_MAX,
            )
        )

    # NOTE: ExpectColumnValuesToBeIncreasing on tz-aware datetime columns is
    # broken in GX 1.12.3 — internally it attempts `series_diff[null] = 1`
    # which raises a FutureError against timedelta64[ns] dtype in pandas 2.x.
    # The sort contract is enforced by preprocessing (df.sort_values) and
    # tested at the unit level in tests/test_data/test_preprocess.py.
    #
    # NOTE: ExpectColumnValuesToBeBetween with tz-aware Timestamp bounds is
    # also unreliable in GX 1.12.3. Date range integrity is implicitly
    # covered by the null check and the upstream raw validation (validate_raw).

    return suite

Future Matches

Expectation suite for preprocessed future matches (interim/future.parquet).

Pure functions only — no IO, no side effects.

future.parquet is produced by preprocessing: status=1 matches whose kickoff is strictly after the last finished match. Score and target columns are absent by design — this is the anti-leakage guarantee checked here.

Key invariants vs. finished: - No score columns (homeScore, awayScore and all extra-time/penalty columns are dropped in preprocess_and_split to prevent leakage). - No derived target columns (outcome_1x2, sumScore, diffScore are absent). - All match identity columns must be non-null. - Unique match IDs (no duplicates from the scraper).

build_future_suite(context)

Build and return the expectation suite for preprocessed future matches.

Parameters:

Name Type Description Default
context AbstractDataContext

An active GX DataContext (ephemeral or file-based).

required

Returns:

Type Description
ExpectationSuite

An ExpectationSuite with all future-data expectations added.

Source code in src/data_quality/future.py
def build_future_suite(context: AbstractDataContext) -> ExpectationSuite:
    """Build and return the expectation suite for preprocessed future matches.

    Args:
        context: An active GX DataContext (ephemeral or file-based).

    Returns:
        An ExpectationSuite with all future-data expectations added.
    """
    suite = context.suites.add(ExpectationSuite(name="future_match_suite"))

    # Schema — exact_match=True verifies that score/target columns are absent
    # (anti-leakage guarantee: future split must not contain homeScore, awayScore,
    # outcome_1x2, sumScore, diffScore, or any extra-time/penalty columns).
    suite.add_expectation(
        gx.expectations.ExpectTableColumnsToMatchSet(
            column_set=REQUIRED_COLUMNS,
            exact_match=True,
        )
    )

    # Row count — the pipeline should always produce at least one upcoming fixture.
    # Zero rows indicates a stale snapshot or upstream scraping failure.
    suite.add_expectation(gx.expectations.ExpectTableRowCountToBeBetween(min_value=1))

    # Nullability — all identity columns must be populated.
    for col in NOT_NULL_COLUMNS:
        suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column=col))

    # Uniqueness — each future match must appear exactly once.
    suite.add_expectation(gx.expectations.ExpectColumnValuesToBeUnique(column="id"))

    return suite

Engineered Features

Expectation suite for engineered features (features.parquet).

Pure functions only — no IO, no side effects.

features.parquet is indexed by match id (set via set_index in to_match_level). All columns are numeric feature columns; no label/target columns are present.

Naming convention (from stats_matches.py / parse_feature): {side}{scope}{agg}_w{window} — for stat rolling columns {side} — for coverage rolling columns}_coverage_w{window

where side ∈ {home, away, diff}, scope ∈ {all, season, tournament, ha}. After to_match_level(), home_/away_/diff_ prefixes are applied per-side.

build_features_suite(context)

Build the static schema expectation suite for engineered features.

Validates structural properties that do not depend on the specific set of feature columns (which varies with params.yaml settings).

Parameters:

Name Type Description Default
context AbstractDataContext

An active GX DataContext (ephemeral or file-based).

required

Returns:

Type Description
ExpectationSuite

An ExpectationSuite with schema-level expectations.

Source code in src/data_quality/features.py
def build_features_suite(context: AbstractDataContext) -> ExpectationSuite:
    """Build the static schema expectation suite for engineered features.

    Validates structural properties that do not depend on the specific
    set of feature columns (which varies with params.yaml settings).

    Args:
        context: An active GX DataContext (ephemeral or file-based).

    Returns:
        An ExpectationSuite with schema-level expectations.
    """
    suite = context.suites.add(ExpectationSuite(name="features_schema_suite"))

    # Row count
    suite.add_expectation(gx.expectations.ExpectTableRowCountToBeBetween(min_value=1))

    return suite

build_features_column_expectations(context, feature_columns)

Build per-column completeness and range expectations for feature columns.

Called at runtime with the actual column list from features_meta.parquet so that it adapts automatically to changes in params.yaml (window sizes, stats columns, etc.) without requiring code changes.

Parameters:

Name Type Description Default
context AbstractDataContext

An active GX DataContext.

required
feature_columns list[str]

List of feature column names from features_meta["name"].

required

Returns:

Type Description
ExpectationSuite

An ExpectationSuite with per-column expectations.

Source code in src/data_quality/features.py
def build_features_column_expectations(
    context: AbstractDataContext,
    feature_columns: list[str],
) -> ExpectationSuite:
    """Build per-column completeness and range expectations for feature columns.

    Called at runtime with the actual column list from features_meta.parquet
    so that it adapts automatically to changes in params.yaml (window sizes,
    stats columns, etc.) without requiring code changes.

    Args:
        context: An active GX DataContext.
        feature_columns: List of feature column names from features_meta["name"].

    Returns:
        An ExpectationSuite with per-column expectations.
    """
    suite = context.suites.add(ExpectationSuite(name="features_columns_suite"))

    for col in feature_columns:
        # Completeness
        # Cold-start: early matches for a new season/tournament produce NaN
        # because the shifted rolling window has no prior observations.
        # Observed maximum null fraction: ~8.4% (diff_season columns, w=3).
        # mostly=0.9 allows up to 10% null, catching catastrophic null injection
        # (e.g., broken join or dropped column) while tolerating cold-start.
        # NOTE: ge_sample_rows must be large enough for stable statistics.
        # At 500 rows the null-fraction variance (~6.2σ) causes spurious
        # failures for diff_season columns; smoke.yaml uses 5000 rows instead.
        suite.add_expectation(
            gx.expectations.ExpectColumnValuesToNotBeNull(column=col, mostly=0.9)
        )

        # Value ranges
        if _RATE_PATTERN.match(col):
            # Rolling win/draw/loss are weighted means: values in [0.0, 1.0].
            # mostly=0.99 covers the rare legitimate edge case of diff_ columns
            # that can be in [-1, 1] for subtracted rates. Diff columns are
            # excluded from this pattern by the regex (would need separate handling
            # if strict bounds on diff columns are required).
            suite.add_expectation(
                gx.expectations.ExpectColumnValuesToBeBetween(
                    column=col,
                    min_value=0.0 if not col.startswith("diff_") else -1.0,
                    max_value=1.0,
                    mostly=0.99,
                )
            )

        elif _COVERAGE_PATTERN.match(col):
            # Coverage: fraction of window filled.
            # home_/away_: always in [0, 1].
            # diff_: home_coverage - away_coverage, so range is [-1, 1].
            suite.add_expectation(
                gx.expectations.ExpectColumnValuesToBeBetween(
                    column=col,
                    min_value=0.0 if not col.startswith("diff_") else -1.0,
                    max_value=1.0,
                )
            )

        elif _GOALS_PATTERN.match(col):
            # Rolling goal averages: non-negative for home_/away_; diff_ can be
            # negative (away scored more than home). Unbounded range: no max check.
            if not col.startswith("diff_"):
                suite.add_expectation(
                    gx.expectations.ExpectColumnValuesToBeBetween(
                        column=col,
                        min_value=0.0,
                        max_value=None,
                    )
                )
            # diff_ goals: only null-check applies (added above); skip range.

    return suite