Data Quality¶
Great Expectations suite builders — pure functions, no IO.
Raw Matches¶
Expectation suite for raw match data (match_raw.parquet).
Pure functions only — no IO, no side effects. All expectations operate on a pandas DataFrame passed at call time.
Raw data is the source-as-downloaded from MinIO before any preprocessing. It does NOT contain derived columns such as outcome_1x2 (computed in preprocessing).
build_raw_suite(context)
¶
Build and return the expectation suite for raw match data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
AbstractDataContext
|
An active GX DataContext (ephemeral or file-based). |
required |
Returns:
| Type | Description |
|---|---|
ExpectationSuite
|
An ExpectationSuite with all raw-data expectations added. |
Source code in src/data_quality/raw.py
Interim (Preprocessed) Matches¶
Expectation suite for preprocessed finished matches (interim/finished.parquet).
Pure functions only — no IO, no side effects.
finished.parquet is the output of preprocessing: status=6 matches only,
with computed targets (outcome_1x2, sumScore, diffScore) and dropped raw columns.
Index is id (set in preprocess_and_split via df_matches.index = df_matches["id"]).
Key invariants vs. raw: - outcome_1x2 ∈ {0, 1, 2} (0=home win, 1=draw, 2=away win) - homeScore / awayScore are non-negative int8 values - startTimeUtc is tz-aware (UTC) and monotonically non-decreasing (sorted) - No nulls in any column (all null-bearing raw columns are dropped)
GX 1.12.3 known limitations:
- ExpectColumnValuesToBeIncreasing is broken for tz-aware datetime columns;
it attempts series_diff[null] = 1 which raises FutureError in pandas 2.x.
- ExpectColumnValuesToBeBetween with tz-aware Timestamp bounds is unreliable.
Both are omitted; the sort contract is enforced by preprocessing (sort_values)
and temporal range is covered upstream by validate_raw.
build_interim_suite(context)
¶
Build and return the expectation suite for preprocessed finished matches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
AbstractDataContext
|
An active GX DataContext (ephemeral or file-based). |
required |
Returns:
| Type | Description |
|---|---|
ExpectationSuite
|
An ExpectationSuite with all interim-data expectations added. |
Source code in src/data_quality/interim.py
Finished Matches¶
Expectation suite for preprocessed finished matches (interim/finished.parquet).
Pure functions only — no IO, no side effects.
finished.parquet is the output of preprocessing: status=6 matches only,
with computed targets (outcome_1x2, sumScore, diffScore) and dropped raw columns.
Index is id (set in preprocess_and_split via df_matches.index = df_matches["id"]).
Key invariants vs. raw: - outcome_1x2 ∈ {0, 1, 2} (0=home win, 1=draw, 2=away win) - homeScore / awayScore are non-negative int8 values - startTimeUtc is tz-aware (UTC) and monotonically non-decreasing (sorted) - No nulls in any column (all null-bearing raw columns are dropped)
GX 1.12.3 known limitations:
- ExpectColumnValuesToBeIncreasing is broken for tz-aware datetime columns;
it attempts series_diff[null] = 1 which raises FutureError in pandas 2.x.
- ExpectColumnValuesToBeBetween with tz-aware Timestamp bounds is unreliable.
Both are omitted; the sort contract is enforced by preprocessing (sort_values)
and temporal range is covered upstream by validate_raw.
build_finished_suite(context)
¶
Build and return the expectation suite for preprocessed finished matches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
AbstractDataContext
|
An active GX DataContext (ephemeral or file-based). |
required |
Returns:
| Type | Description |
|---|---|
ExpectationSuite
|
An ExpectationSuite with all finished-data expectations added. |
Source code in src/data_quality/finished.py
Future Matches¶
Expectation suite for preprocessed future matches (interim/future.parquet).
Pure functions only — no IO, no side effects.
future.parquet is produced by preprocessing: status=1 matches whose kickoff is strictly after the last finished match. Score and target columns are absent by design — this is the anti-leakage guarantee checked here.
Key invariants vs. finished: - No score columns (homeScore, awayScore and all extra-time/penalty columns are dropped in preprocess_and_split to prevent leakage). - No derived target columns (outcome_1x2, sumScore, diffScore are absent). - All match identity columns must be non-null. - Unique match IDs (no duplicates from the scraper).
build_future_suite(context)
¶
Build and return the expectation suite for preprocessed future matches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
AbstractDataContext
|
An active GX DataContext (ephemeral or file-based). |
required |
Returns:
| Type | Description |
|---|---|
ExpectationSuite
|
An ExpectationSuite with all future-data expectations added. |
Source code in src/data_quality/future.py
Engineered Features¶
Expectation suite for engineered features (features.parquet).
Pure functions only — no IO, no side effects.
features.parquet is indexed by match id (set via set_index in to_match_level).
All columns are numeric feature columns; no label/target columns are present.
Naming convention (from stats_matches.py / parse_feature): {side}{scope}{agg}_w{window} — for stat rolling columns {side} — for coverage rolling columns}_coverage_w{window
where side ∈ {home, away, diff}, scope ∈ {all, season, tournament, ha}. After to_match_level(), home_/away_/diff_ prefixes are applied per-side.
build_features_suite(context)
¶
Build the static schema expectation suite for engineered features.
Validates structural properties that do not depend on the specific set of feature columns (which varies with params.yaml settings).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
AbstractDataContext
|
An active GX DataContext (ephemeral or file-based). |
required |
Returns:
| Type | Description |
|---|---|
ExpectationSuite
|
An ExpectationSuite with schema-level expectations. |
Source code in src/data_quality/features.py
build_features_column_expectations(context, feature_columns)
¶
Build per-column completeness and range expectations for feature columns.
Called at runtime with the actual column list from features_meta.parquet so that it adapts automatically to changes in params.yaml (window sizes, stats columns, etc.) without requiring code changes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
AbstractDataContext
|
An active GX DataContext. |
required |
feature_columns
|
list[str]
|
List of feature column names from features_meta["name"]. |
required |
Returns:
| Type | Description |
|---|---|
ExpectationSuite
|
An ExpectationSuite with per-column expectations. |