Data Ingestion Pipeline¶

This page covers the full ingestion path: from untrusted external sources to DVC-versioned, immutable parquet snapshots ready for the ML pipeline.

Sources¶

Primary source: WhoScored.com (match data)¶

All football match statistics, team records, event data, and outcomes originate from WhoScored.com — the system's only upstream match data source.

WhoScored is an untrusted external dependency. The system treats it as such:

its HTML structure is uncontrolled and subject to change without notice,
data may be partial, delayed, or temporarily unavailable,
there is no API contract — scraping is fragile by nature.

Trust model

Stage	Trust level	Consequence
WhoScored.com HTML	Untrusted	Validated by Great Expectations after raw export
Scraped records in PostgreSQL	Canonical but unvalidated	Validation deferred to the `validate_raw` gate
Raw parquet snapshot	Validated against contract	Only data that passes `validate_raw` proceeds

Current limitations

Limitation	Status
Retry with backoff on Celery task failure	✅ Implemented
Airflow DAG failure visibility	✅ Airflow UI
Freshness monitoring / alerts	📋 Planned
Cached HTML fallback	❌ Not implemented
Scrape volume anomaly detection	📋 Planned

Secondary source: Bookmaker odds (Pari)¶

Status: ✅ Implemented — src/data/odds_pari.py; daily Airflow DAG airflow/dags/etl_odds_01.py (schedule @daily, 00:05 UTC); output: data/raw/odds_pari/date=YYYY-MM-DD/snapshot.parquet.

Purpose: External evaluation only — bookmaker implied probabilities serve as the Tier 4 benchmark. Odds are not used as model input features.

Endpoint: GET https://line-lb61-w.bk6bba-resources.com/ma/events/list · Unofficial internal API, no public contract.

1X2 factor IDs: 921 (home win), 922 (draw), 923 (away win).

Historical odds fallback (Path B): src/data/odds_fdco.py + DVC stage load_odds_fdco — uses football-data.co.uk closing odds (Bet365). Leagues and seasons configured in params.yaml under odds_fdco. Team name matching via fuzzy join in src/data/odds_join.py (threshold=85, fuzzywuzzy).

Tertiary source: Bookmaker odds (Fonbet — live serving)¶

Status: ✅ Implemented — three-stage Airflow pipeline chain producing live 1X2 odds for upcoming matches shown in the Streamlit UI.

DAG chain: - soccer_etl_odds_fonbet_01_raw — raw scrape (every 4 h) - soccer_etl_odds_fonbet_02_link — fuzzy-match Fonbet matches to WhoScored IDs - soccer_etl_odds_fonbet_03_odds — extract 1X2 factors

Not used as ML input features. Not used for historical evaluation (no historical archive).

Scraping mechanism (WhoScored)¶

Airflow DAG (schedule)
  → POST /scrape  →  FastAPI
  → enqueue task  →  RabbitMQ "api" queue
  → Celery worker-api  →  Selenoid (headless browser)
  → WhoScored.com  →  PostgreSQL (canonical store)

Browser automation is required because WhoScored requires JavaScript rendering. Selenoid is an operator-managed external service, outside the K8s cluster.

Idempotency: records are upserted in PostgreSQL using natural/composite keys; safe to replay.

Status: ✅ Implemented

Ingestion boundary (Airflow → PostgreSQL)¶

Airflow's responsibility ends at PostgreSQL. It does not manage DVC, trigger ML pipelines, or write parquet files.

PostgreSQL as canonical store

PostgreSQL is the authoritative structured representation of all scraped match data. It is deduplicated, normalized, and schema-stable. It is not the ML training source — training data comes from versioned parquet snapshots exported from PostgreSQL, because:

PostgreSQL is live and mutable (corrections, late arrivals, schema migrations),
ML reproducibility requires immutable, content-addressed inputs.

ETL guarantees

Guarantee	Mechanism
Idempotency	Upsert logic with dedup keys; safe to replay
Schema stability for downstream	DB schema changes versioned; breaking changes blocked until raw export updated
Separation from analytics workloads	ML pipeline reads from parquet, not from live PostgreSQL
No data loss on partial scrape	Failed scrape runs do not corrupt previously ingested records

What ETL does not guarantee

Real-time freshness — ingestion runs on a schedule; late match data is a known gap.
GE validation at ingest time — contracts are enforced at the validate_raw DVC gate, not at DB write.
Automated alerting on ETL failure — Airflow UI surfaces failures; Alertmanager rules are 📋 Planned.

Raw export — reproducibility boundary (PostgreSQL → MinIO → DVC)¶

DVC stage: load_data_from_sources · Status: ✅ Implemented

The raw export is the explicit handoff between live canonical storage and immutable ML inputs. Everything upstream is ingestion. Everything downstream is reproducible ML.

DVC stage: load_data_from_sources
  Input:  PostgreSQL canonical tables
  Output: data/raw/*.parquet  (DVC-tracked, stored in MinIO)
  Trigger: dvc repro (manual or CI)

DVC stage queries PostgreSQL → writes data/raw/*.parquet.
DVC content-addresses outputs → pushes to MinIO.
Git tracks .dvc pointer files; MinIO stores actual data.
Any environment with dvc pull can retrieve the exact snapshot for a given Git commit.

The export runs on dvc repro, not automatically after an Airflow ingestion run. This is intentional: export is an operator action, not a side effect of ingestion.

Anti-patterns this boundary prevents

Anti-pattern	Why avoided
Training directly from PostgreSQL	Non-reproducible; couples ML to live DB
Mutable raw datasets	Breaks traceability; past experiments cannot be reproduced
Ad-hoc local data dumps	Not tracked by DVC; not accessible in CI or shared environments

Connection to downstream pipeline

load_data_from_sources → validate_raw → preprocessing → feature_engineering → ...

The validate_raw gate enforces data contracts before any downstream stage proceeds. A failed validation stops the pipeline. See Contracts & Versioning.

Contracts & Versioning — GE gates and DVC reproducibility
Architecture: Data & ML Flow — stage-by-stage breakdown
Architecture: Failure Modes — scraper failure scenarios
Backfills & Freshness — replaying ingestion for historical corrections