Data Ingestion Pipeline¶
This page covers the full ingestion path: from untrusted external sources to DVC-versioned, immutable parquet snapshots ready for the ML pipeline.
Sources¶
Primary source: WhoScored.com (match data)¶
All football match statistics, team records, event data, and outcomes originate from WhoScored.com — the system's only upstream match data source.
WhoScored is an untrusted external dependency. The system treats it as such:
- its HTML structure is uncontrolled and subject to change without notice,
- data may be partial, delayed, or temporarily unavailable,
- there is no API contract — scraping is fragile by nature.
Trust model
| Stage | Trust level | Consequence |
|---|---|---|
| WhoScored.com HTML | Untrusted | Validated by Great Expectations after raw export |
| Scraped records in PostgreSQL | Canonical but unvalidated | Validation deferred to the validate_raw gate |
| Raw parquet snapshot | Validated against contract | Only data that passes validate_raw proceeds |
Current limitations
| Limitation | Status |
|---|---|
| Retry with backoff on Celery task failure | ✅ Implemented |
| Airflow DAG failure visibility | ✅ Airflow UI |
| Freshness monitoring / alerts | 📋 Planned |
| Cached HTML fallback | ❌ Not implemented |
| Scrape volume anomaly detection | 📋 Planned |
Secondary source: Bookmaker odds (Pari)¶
Status: ✅ Implemented — src/data/odds_pari.py; daily Airflow DAG airflow/dags/etl_odds_01.py (schedule @daily, 00:05 UTC); output: data/raw/odds_pari/date=YYYY-MM-DD/snapshot.parquet.
Purpose: External evaluation only — bookmaker implied probabilities serve as the Tier 4 benchmark. Odds are not used as model input features.
Endpoint: GET https://line-lb61-w.bk6bba-resources.com/ma/events/list · Unofficial internal API, no public contract.
1X2 factor IDs: 921 (home win), 922 (draw), 923 (away win).
Historical odds fallback (Path B): src/data/odds_fdco.py + DVC stage load_odds_fdco — uses
football-data.co.uk closing odds (Bet365). Leagues and
seasons configured in params.yaml under odds_fdco. Team name matching via fuzzy join in
src/data/odds_join.py (threshold=85, fuzzywuzzy).
Tertiary source: Bookmaker odds (Fonbet — live serving)¶
Status: ✅ Implemented — three-stage Airflow pipeline chain producing live 1X2 odds for upcoming matches shown in the Streamlit UI.
DAG chain:
- soccer_etl_odds_fonbet_01_raw — raw scrape (every 4 h)
- soccer_etl_odds_fonbet_02_link — fuzzy-match Fonbet matches to WhoScored IDs
- soccer_etl_odds_fonbet_03_odds — extract 1X2 factors
Not used as ML input features. Not used for historical evaluation (no historical archive).
Scraping mechanism (WhoScored)¶
Airflow DAG (schedule)
→ POST /scrape → FastAPI
→ enqueue task → RabbitMQ "api" queue
→ Celery worker-api → Selenoid (headless browser)
→ WhoScored.com → PostgreSQL (canonical store)
Browser automation is required because WhoScored requires JavaScript rendering. Selenoid is an operator-managed external service, outside the K8s cluster.
Idempotency: records are upserted in PostgreSQL using natural/composite keys; safe to replay.
Status: ✅ Implemented
Ingestion boundary (Airflow → PostgreSQL)¶
Airflow's responsibility ends at PostgreSQL. It does not manage DVC, trigger ML pipelines, or write parquet files.
PostgreSQL as canonical store
PostgreSQL is the authoritative structured representation of all scraped match data. It is deduplicated, normalized, and schema-stable. It is not the ML training source — training data comes from versioned parquet snapshots exported from PostgreSQL, because:
- PostgreSQL is live and mutable (corrections, late arrivals, schema migrations),
- ML reproducibility requires immutable, content-addressed inputs.
ETL guarantees
| Guarantee | Mechanism |
|---|---|
| Idempotency | Upsert logic with dedup keys; safe to replay |
| Schema stability for downstream | DB schema changes versioned; breaking changes blocked until raw export updated |
| Separation from analytics workloads | ML pipeline reads from parquet, not from live PostgreSQL |
| No data loss on partial scrape | Failed scrape runs do not corrupt previously ingested records |
What ETL does not guarantee
- Real-time freshness — ingestion runs on a schedule; late match data is a known gap.
- GE validation at ingest time — contracts are enforced at the
validate_rawDVC gate, not at DB write. - Automated alerting on ETL failure — Airflow UI surfaces failures; Alertmanager rules are 📋 Planned.
Raw export — reproducibility boundary (PostgreSQL → MinIO → DVC)¶
DVC stage: load_data_from_sources · Status: ✅ Implemented
The raw export is the explicit handoff between live canonical storage and immutable ML inputs. Everything upstream is ingestion. Everything downstream is reproducible ML.
DVC stage: load_data_from_sources
Input: PostgreSQL canonical tables
Output: data/raw/*.parquet (DVC-tracked, stored in MinIO)
Trigger: dvc repro (manual or CI)
- DVC stage queries PostgreSQL → writes
data/raw/*.parquet. - DVC content-addresses outputs → pushes to MinIO.
- Git tracks
.dvcpointer files; MinIO stores actual data. - Any environment with
dvc pullcan retrieve the exact snapshot for a given Git commit.
The export runs on dvc repro, not automatically after an Airflow ingestion run.
This is intentional: export is an operator action, not a side effect of ingestion.
Anti-patterns this boundary prevents
| Anti-pattern | Why avoided |
|---|---|
| Training directly from PostgreSQL | Non-reproducible; couples ML to live DB |
| Mutable raw datasets | Breaks traceability; past experiments cannot be reproduced |
| Ad-hoc local data dumps | Not tracked by DVC; not accessible in CI or shared environments |
Connection to downstream pipeline
The validate_raw gate enforces data contracts before any downstream stage proceeds.
A failed validation stops the pipeline. See Contracts & Versioning.
Related¶
- Contracts & Versioning — GE gates and DVC reproducibility
- Architecture: Data & ML Flow — stage-by-stage breakdown
- Architecture: Failure Modes — scraper failure scenarios
- Backfills & Freshness — replaying ingestion for historical corrections