Data¶
Parameters¶
load_params(params_path='params.yaml')
¶
Load and return the project params.yaml as a dict.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
params_path
|
str | Path
|
Path to the params file. Relative paths are resolved
against the project root (two levels above |
'params.yaml'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Parsed YAML content as a plain |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
When the resolved path does not exist. |
ValueError
|
When the file does not contain a YAML mapping. |
Source code in src/data/params.py
Source¶
load_data_from_source(output_path_matches, output_path_matches_raw)
¶
Download match data files from the configured MinIO source.
Calls export_data_raw for each path; files are only
re-downloaded when the remote ETag or size has changed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_path_matches
|
Path
|
Local destination for the processed matches parquet file. |
required |
output_path_matches_raw
|
Path
|
Local destination for the raw matches parquet file. |
required |
Source code in src/data/source.py
Preprocessing¶
export_matches_metadata(df_match_raw, metadata_path)
¶
Export ID-to-name mapping JSON files from a raw match DataFrame.
Writes one JSON file per entity type (tournamentId, regionId,
stageId, seasonId, homeTeamId, awayTeamId, homeTeamCountryName,
awayTeamCountryName) to metadata_path. Each file maps the
numeric ID (or name) to its corresponding string name or code.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_match_raw
|
DataFrame
|
Raw match DataFrame containing name columns such
as |
required |
metadata_path
|
Path
|
Directory where JSON files are written. |
required |
Source code in src/data/preprocess.py
load_matches_metadata(metadata_path)
¶
Load all ID-to-name mapping JSONs from the metadata directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata_path
|
Path
|
Directory containing the JSON files produced by
|
required |
Returns:
| Type | Description |
|---|---|
dict
|
Dict keyed by entity type (e.g. |
dict
|
values are the corresponding id-to-name mapping dicts. |
Source code in src/data/preprocess.py
preprocess_and_split(df_matches, score_outlier_pct=0.9999, reference_date=None)
¶
Preprocess raw match data and split into finished and future sets.
Drops irrelevant columns, downcasts dtypes, derives classification and regression targets, clips extreme score outliers, and splits by match status (6=finished, 1=upcoming).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_matches
|
DataFrame
|
Raw match DataFrame from |
required |
score_outlier_pct
|
float
|
Upper quantile threshold for clipping homeScore and awayScore (e.g. 0.9999 removes the top 0.01% of scores, which are typically forfeits). |
0.9999
|
reference_date
|
datetime | None
|
Override for "now" used only in logging. Defaults to UTC now minus 3 hours. |
None
|
Returns:
| Type | Description |
|---|---|
Tuple of (df_finished, df_future
|
|
-df_finished
|
status=6 matches with targets and downcast dtypes. |
-df_future
|
status=1 matches without score/target columns. |
Source code in src/data/preprocess.py
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 | |
Splitting¶
split_time_based_on(df, date_test_start)
¶
Split a match DataFrame into train/val and test sets by date.
All rows with startTimeUtc < date_test_start go to the
train/val set; the remainder form the test set.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Match DataFrame with |
required |
date_test_start
|
Timestamp
|
Cutoff timestamp (exclusive for train, inclusive for test). |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
Tuple of (df_train_ids, df_test_ids) — each contains only |
DataFrame
|
the |
Source code in src/data/splitting.py
make_year_folds(df_train_val, valid_years)
¶
Create walk-forward CV fold definitions, one fold per calendar year.
For each year in valid_years, all data before that year is used
as the training fold and the named year is used as the validation
fold. Folds where either split is empty are silently skipped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_train_val
|
DataFrame
|
Training/validation DataFrame with a
|
required |
valid_years
|
list[int]
|
List of calendar years to use as validation
windows (e.g. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame with columns
|
fold, year, train_start, train_end, |
DataFrame
|
valid_start, valid_end, train_rows, valid_rows. |
Source code in src/data/splitting.py
Storage (DVC / MinIO)¶
create_client_s3()
¶
Create a boto3 S3 client configured from environment variables.
Reads MINIO_ENDPOINT_URL, MINIO_USER, and
MINIO_PASSWORD from the environment.
Returns:
| Type | Description |
|---|---|
client
|
Configured boto3 S3 client pointed at the MinIO endpoint. |
Source code in src/data/storage.py
get_file_from_minio(bucket_name, object_name, file_path=None)
¶
Download a file from MinIO to local disk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bucket_name
|
Name of the MinIO bucket. |
required | |
object_name
|
S3 key (path) of the object to download. |
required | |
file_path
|
Local destination path. Defaults to |
None
|
Returns:
| Type | Description |
|---|---|
|
Local path where the file was saved. |
Raises:
| Type | Description |
|---|---|
Exception
|
Re-raises any boto3 download error after logging it. |
Source code in src/data/storage.py
get_metadata_from_minio(bucket_name, object_name)
¶
Retrieve object metadata (HEAD) from MinIO without downloading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bucket_name
|
Name of the MinIO bucket. |
required | |
object_name
|
S3 key (path) of the object. |
required |
Returns:
| Type | Description |
|---|---|
|
HEAD response dict from boto3 (ETag, ContentLength, |
|
|
LastModified, etc.). |
Raises:
| Type | Description |
|---|---|
Exception
|
Re-raises any boto3 error after logging it. |
Source code in src/data/storage.py
export_data_raw(local_path, bucket=None)
¶
Download a raw data file from MinIO only when it has changed.
Compares ETag and size from a local .minio.json sidecar file
against the remote object. Downloads and updates the sidecar only
when the remote has changed or the local file is absent.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
local_path
|
Path
|
Local destination path for the downloaded file. |
required |
bucket
|
str | None
|
MinIO bucket name. Defaults to the
|
None
|
Returns:
| Type | Description |
|---|---|
dict
|
HEAD response dict for the remote object (always the live |
dict
|
MinIO response, regardless of whether a download occurred). |
Source code in src/data/storage.py
get_dvc_hash(out_path, lock_path=None)
¶
Look up the DVC MD5 hash for a pipeline output in dvc.lock.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
out_path
|
str
|
DVC output path string as written in |
required |
lock_path
|
Path | None
|
Path to the |
None
|
Returns:
| Type | Description |
|---|---|
str
|
MD5 hash string for the matched output. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Source code in src/data/storage.py
ORM Models¶
Match
¶
Bases: SQLModel
Compact match record used by the internal ORM (match table).
Source code in src/data/models.py
MatchRaw
¶
Bases: SQLModel
Full raw match record mirroring the WhoScored livescores payload.
Source code in src/data/models.py
Scraper¶
create_webdriver()
¶
Create a Selenoid Remote WebDriver with a randomised user-agent.
Returns:
| Type | Description |
|---|---|
Remote
|
Configured |
Remote
|
Selenoid server at |
Source code in src/data/scraper/driver.py
Livescores Validation¶
get_list_livescore_matches(livescores_raw)
¶
Parse a raw WhoScored livescores payload into ORM objects.
Strips HTML tags from livescores_raw, validates the JSON against
the Pydantic Whoscored.livescores schema, and returns two
parallel lists of validated match objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
livescores_raw
|
str
|
Raw HTML/JSON string from the livescores scraper. |
required |
Returns:
| Type | Description |
|---|---|
list[Match]
|
Tuple of (matches, matches_raw) where each element is a list |
list[MatchRaw]
|
of SQLModel instances ready to be upserted into the database. |
Source code in src/data/validation/livescores.py
Odds (Football-Data.co.uk)¶
Download and normalize closing odds from football-data.co.uk.
Two data sources are supported:
-
Standard leagues (mmz4281): URL: https://www.football-data.co.uk/mmz4281/{season}/{league_code}.csv Odds: B365H/D/A (Bet365 opening/closing)
-
Extra leagues (/new/): URL: https://www.football-data.co.uk/new/{country_code}.csv Odds: PSCH/PSCD/PSCA (Pinnacle closing) — ~95% row coverage; B365CH/CD/CA (Bet365 closing) — ~5-15% row coverage Uses Pinnacle closing as p_home/p_draw/p_away reference. b365h/d/a set from B365CH/CD/CA where available, else NaN.
Output schema (one row per match): season : str — e.g. "2425" or "2024" (extra leagues) league_code : str — e.g. "E0" or "NOR" league_name : str — human-readable name from params.yaml date : datetime64 — match date (UTC midnight) home_team : str — home team name as per FDCO away_team : str — away team name as per FDCO ftr : str — full-time result (H/D/A), if available b365h : float — Bet365 decimal odds home win (NaN for extra w/o B365) b365d : float — Bet365 decimal odds draw (NaN for extra w/o B365) b365a : float — Bet365 decimal odds away win(NaN for extra w/o B365) vig : float — bookmaker margin (1/h + 1/d + 1/a) p_home : float — vig-stripped implied probability home win p_draw : float — vig-stripped implied probability draw p_away : float — vig-stripped implied probability away win
fetch_league_csv(season, league_code, timeout=30)
¶
Download a single season/league CSV from football-data.co.uk.
Returns the raw DataFrame with all available columns. Raises requests.HTTPError on non-200 responses.
Source code in src/data/odds_fdco.py
normalize_fdco(df, season, league_code)
¶
Extract and normalize key columns from a raw FDCO DataFrame.
Drops rows with missing Bet365 odds or team names. Computes vig-stripped implied probabilities.
Returns DataFrame with schema defined in module docstring. Returns empty DataFrame if any required odds column is missing.
Source code in src/data/odds_fdco.py
fetch_extra_league_csv(country_code, timeout=30)
¶
Download the extra-league CSV for a country from football-data.co.uk/new/.
The /new/ endpoint returns a single file covering all available seasons. Raises requests.HTTPError on non-200 responses.
Source code in src/data/odds_fdco.py
normalize_fdco_extra(df, country_code, league_name, season_years)
¶
Normalize a raw extra-league DataFrame from the /new/ endpoint.
Extra-league CSVs differ from standard ones
- Team columns are
Home/Away(notHomeTeam/AwayTeam). - Odds columns are Pinnacle closing (PSCH/PSCD/PSCA) and optionally Bet365 closing (B365CH/B365CD/B365CA).
- A numeric
Seasoncolumn (year) is used for date filtering.
Reference probabilities are vig-stripped from Pinnacle closing odds. b365h/d/a are filled from B365CH/CD/CA where available, else NaN.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
raw DataFrame from |
required |
country_code
|
str
|
country code used as |
required |
league_name
|
str
|
human-readable name (e.g. "Norway Eliteserien"). |
required |
season_years
|
list[int]
|
list of calendar years to keep (e.g. [2023, 2024, 2025, 2026]). |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
Normalized DataFrame with the same schema as |
DataFrame
|
Empty DataFrame if Pinnacle odds columns are absent. |
Source code in src/data/odds_fdco.py
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 | |
load_odds_fdco(output_path, seasons, leagues, extra_leagues=None)
¶
Download, normalize, and save closing odds to parquet.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_path
|
Path
|
destination .parquet file path. |
required |
seasons
|
list[str]
|
list of season codes for standard leagues, e.g. ["2425", "2324"]. |
required |
leagues
|
list[dict[str, str]]
|
list of dicts with |
required |
extra_leagues
|
list[dict[str, str]] | None
|
optional list of dicts with "code" and "name" keys for extra-league countries (fetched from /new/{code}.csv). These use Pinnacle closing odds as reference probabilities. |
None
|
Raises:
| Type | Description |
|---|---|
RuntimeError
|
if no data could be fetched from any source. |
Source code in src/data/odds_fdco.py
219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 | |
Odds (Fonbet)¶
Production module for collecting fon.bet odds snapshots via Selenium.
fon.bet is a single-page application (SPA) — there is no public REST API available via a simple HTTP GET. All data is loaded via XHR/fetch requests made by the browser JavaScript bundle. This module injects a Chrome DevTools Protocol (CDP) script into the page before navigation, intercepts every JSON response the SPA fetches, and stores the complete snapshot.
Provides
save_daily_snapshot(output_dir) -> str Launch a headless Chrome via Selenoid, navigate to the fon.bet football page, capture all JSON API responses, and persist a single compressed archive per run:
Writes to MinIO (primary):
s3://{MINIO_BUCKET_DATA_RAW}/odds_fonbet/date=YYYY-MM-DD/HHMMSS.json.gz
Falls back to local filesystem:
{output_dir}/date=YYYY-MM-DD/HHMMSS.json.gz
Snapshot format
A gzip-compressed JSON array of response objects::
[
{"url": "https://…/events/listBase?…", "body": "{…raw JSON string…}"},
{"url": "https://…/geoCategories?…", "body": "{…}"},
…
]
The body field contains the raw JSON string exactly as returned by the
server — not re-parsed. Each captured entry is a JSON API response the SPA
loaded during page initialisation. Typically 30–40 responses per snapshot,
with the main events/listBase response (~9 MB) being the largest.
Key endpoints captured (by URL fragment): events/listBase — all events + sports hierarchy + geo versions sportEvent/geoCategories — country/region lookup (ISO + internal IDs) sportEvent/sportCategories — league/tournament categories factorsCatalog/… — available bet types (factors) line/logos — team/league logo metadata
Usage in downstream code::
import gzip, json
from src.data.odds_fonbet import save_daily_snapshot, load_snapshot
# collect a fresh snapshot
path = save_daily_snapshot()
# load an existing snapshot
captured = load_snapshot(path)
events_raw = json.loads(
next(c["body"] for c in captured if "events/listBase" in c["url"])
)
save_daily_snapshot(output_dir='data/raw/odds_fonbet')
¶
Capture the full fon.bet page load and save all JSON responses.
Uses Selenium + CDP to intercept every JSON API call the SPA makes on page load, then writes the complete list as a gzip-compressed JSON file.
Partition layout (Hive-compatible)::
odds_fonbet/date=YYYY-MM-DD/HHMMSS.json.gz
Writes to MinIO when MINIO_* environment variables are configured (primary)::
s3://{MINIO_BUCKET_DATA_RAW}/odds_fonbet/date=YYYY-MM-DD/HHMMSS.json.gz
Falls back to local filesystem::
{output_dir}/date=YYYY-MM-DD/HHMMSS.json.gz
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
Path | str
|
Local fallback root directory. Ignored when MinIO is available. |
'data/raw/odds_fonbet'
|
Returns:
| Type | Description |
|---|---|
str
|
URI or path of the written file as a string. |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
if Selenium capture returns no data. |
Source code in src/data/odds_fonbet.py
load_snapshot(path)
¶
Load a previously saved fon.bet snapshot from a local or S3 path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Local path or |
required |
Returns:
| Type | Description |
|---|---|
List of ``{"url"
|
str, "body": str}`` dicts. |
Source code in src/data/odds_fonbet.py
Odds (Join)¶
Join football-data.co.uk odds to the holdout set by team name and date.
The WhoScored dataset uses numeric teamIds; FDCO uses string team names. Join strategy: 1. Build teamId -> name index from data/metadata/homeTeamId.json + awayTeamId.json 2. Exact match on (date, home_team_lower, away_team_lower) 3. Fuzzy fallback via fuzzywuzzy (threshold=85) for unmatched rows
Returns a tuple of three arrays aligned to df_holdout row order
reference_proba : np.ndarray (n, 3) — vig-stripped implied probabilities actual_odds : np.ndarray (n, 3) — raw Bet365 decimal odds (b365h/d/a) league_codes : np.ndarray (n,) — FDCO league_code str, or None if unmatched
Unmatched rows contain NaN in reference_proba and actual_odds.
join_odds_to_holdout(df_holdout, df_odds, metadata_dir)
¶
Return odds arrays aligned to df_holdout row order.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_holdout
|
DataFrame
|
holdout DataFrame; must contain homeTeamId, awayTeamId, and startTimeUtc columns. |
required |
df_odds
|
DataFrame
|
normalized FDCO DataFrame; must contain date, home_team, away_team, p_home, p_draw, p_away, b365h, b365d, b365a, league_code columns. |
required |
metadata_dir
|
Path
|
path to the data/metadata/ directory. |
required |
Returns:
| Type | Description |
|---|---|
Tuple of three arrays, all shape (len(df_holdout), ...
|
|
-reference_proba
|
(n, 3) float — vig-stripped probs; NaN if unmatched |
-actual_odds
|
(n, 3) float — B365 decimal odds; NaN if unmatched |
-league_codes
|
(n,) object — FDCO league_code str; None if unmatched |
Source code in src/data/odds_join.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 | |