Skip to content

Component View (C4 — Level 3)

This view breaks down the internal components of both the offline ML pipeline and the online runtime. Each component has a defined responsibility, contract, failure behavior, and implementation status.


Component Map

flowchart TB
    subgraph Offline[Offline ML Pipeline — DVC Stages]
      SE[Source Extraction]
      PP[Preprocessing]
      GE1[GE: validate_raw]
      GE2[GE: validate_finished / validate_future]
      FE[Feature Engineering]
      GE3[GE: validate_features]
      SP[Temporal Split]
      BL[Baseline Model]
      XGB[Classifier]
      AB[Ablation Study]
      TN[Hyperparameter Tuning]
      FT[Final Train + Calibration]
      BI[Batch Inference\nFeature Assembly]
      MR[Model Registration]
    end

    subgraph Contracts[Contract Layer]
      DC[Data Contract — GE Suites]
      MC[Model Contract — MLflow Signature]
      AC[API Contract — Pydantic Schemas]
    end

    subgraph Runtime[Online Runtime — FastAPI + Celery]
      RV[Request Validation]
      FA[Feature Assembly at Inference]
      IE[Inference Execution]
      TD[Task Dispatch\nSync vs Async]
      TM[Telemetry — Prometheus]
    end

    SE --> GE1 --> PP --> GE2 --> FE --> GE3 --> SP
    SP --> BL
    SP --> XGB
    SP --> AB
    SP --> TN
    TN --> FT
    FT --> MR
    BI -.feature parquet.-> IE

    DC -.gates.-> GE1
    DC -.gates.-> GE2
    DC -.gates.-> GE3
    MC -.enforced.-> MR
    AC -.enforced.-> RV

    MR --> IE
    RV --> TD
    TD --> FA
    FA --> IE
    IE --> TM

Offline Pipeline Components

Source Extraction — ✅ Implemented

Attribute Detail
Responsibility Scrape WhoScored.com via Selenoid; normalize data; write to PostgreSQL; export raw parquet to MinIO
Inputs Airflow schedule trigger → FastAPI HTTP → RabbitMQ → celery-worker-api
Outputs PostgreSQL tables (canonical scraped data); data/raw/*.parquet (DVC-tracked)
Contract None at input; GE validate_raw gate immediately after
Failure behavior Celery retry with backoff; Airflow marks DAG failed; data gap logged
Idempotency Upsert logic in PostgreSQL; safe to replay

Preprocessing — ✅ Implemented

Attribute Detail
Responsibility Clean and normalize raw match records; resolve team/tournament IDs; produce finished.parquet and future.parquet
Inputs data/raw/*.parquet (DVC stage dep)
Outputs data/interim/finished.parquet, data/interim/future.parquet
Contract GE validate_finished and validate_future gates downstream
Failure behavior DVC stage failure; pipeline blocked
Idempotency Deterministic; safe to re-run

Great Expectations Validation Gates — ✅ Implemented

Three distinct GE suites act as blocking gates:

Gate DVC stage Dataset Failure action
validate_raw After load_data_from_sources Raw parquet Block pipeline; raise on expectation failure
validate_finished / validate_future After preprocessing Interim parquets Block feature engineering
validate_features After feature_engineering Feature parquet Block training

Suites are versioned in the repository. Schema evolution requires explicit suite updates.


Feature Engineering — ✅ Implemented

Attribute Detail
Responsibility Compute time-windowed match statistics and rating-based features for each team
Inputs data/interim/finished.parquet, data/interim/future.parquet
Outputs data/features/*.parquet
Contract GE validate_features gate downstream
Architectural invariant Feature logic (src/features/) is shared between the offline pipeline and online inference — no separate implementation for serving
Failure behavior DVC stage failure
Idempotency Deterministic; pure functions; no IO side-effects

Temporal Split — ✅ Implemented

Attribute Detail
Responsibility Split data into training folds and holdout set using time-based boundaries (no random shuffling)
Inputs data/features/*.parquet
Outputs data/splits/*.parquet (folds + holdout)
Contract No data from the holdout period may appear in any training fold (leakage invariant); split boundaries come from params.yaml
Failure behavior DVC stage failure if leakage detected
Idempotency Deterministic given fixed split configuration

Baseline Model — ✅ Implemented

Attribute Detail
Responsibility Train a reference model to establish a minimum performance bound
Inputs data/splits/
Outputs MLflow run with baseline metrics
Contract Provides a lower-bound benchmark; all production candidates must exceed it
Failure behavior DVC stage failure

Gradient Boosting Classifier — ✅ Implemented

Attribute Detail
Responsibility Train a gradient boosting classifier for match outcome prediction
Inputs data/splits/
Outputs MLflow run; serialized model artifact
Target outcome_1x2 (match result)
Failure behavior DVC stage failure; partial metrics logged to MLflow

Ablation Study — ✅ Implemented

Attribute Detail
Responsibility Measure the contribution of individual feature groups to model performance
Inputs data/splits/
Outputs MLflow runs per feature set configuration
Contract Results inform which feature groups are retained in the production pipeline
Failure behavior DVC stage failure; individual runs logged to MLflow

Hyperparameter Tuning — ✅ Implemented

Attribute Detail
Responsibility Search the model hyperparameter space and select the configuration that maximizes holdout performance
Inputs data/splits/ + tuning configuration from params.yaml
Outputs Best hyperparameter set (artifact); MLflow runs per trial
Failure behavior DVC stage failure; partial trial results preserved in MLflow

Final Train + Calibration — ✅ Implemented

Attribute Detail
Responsibility Train the final model on the full training set with selected hyperparameters; apply probability calibration
Inputs data/splits/, best params from tuning stage
Outputs Calibrated model artifact; MLflow run
Contract Calibrated probability outputs signed in MLflow model signature
Failure behavior DVC stage failure

Model Registration — 🚧 Partially Implemented

Attribute Detail
Responsibility Register final model to MLflow Registry; assign version; promote to Staging
Inputs Final calibrated model artifact; MLflow run ID
Outputs MLflow registered model version
Contract MLflow pyfunc model signature enforced at registration
Current limitation Staging → Production promotion is manual; no automated metric-threshold gate
Planned Automated promotion policy (see Roadmap)

Batch Inference Feature Assembly — ✅ Implemented

Attribute Detail
Responsibility Assemble feature vectors for all upcoming matches; write to data/predictions/match_features.parquet
Inputs data/features/, future match schedule
Outputs data/predictions/match_features.parquet
Contract Feature schema must match training feature schema
Failure behavior DVC stage failure

Online Runtime Components

Request Validation — ✅ Implemented

Attribute Detail
Responsibility Validate all incoming API requests against Pydantic schemas before any processing
Schema PredictRequest / PredictResponse in src/app/schemas/
Failure behavior Returns 422 Unprocessable Entity with structured error details; no inference runs
Contract API contract; OpenAPI schema auto-generated by FastAPI

Feature Assembly at Inference — ✅ Implemented

Attribute Detail
Responsibility Assemble feature vectors at inference time using the same src/features/ code as the offline pipeline
Inputs Match context from request; historical data from Redis cache or recomputed
Outputs Feature vector matching training schema
Contract Must produce identical features to offline pipeline for the same input
Failure behavior Inference task fails; error returned to FastAPI

Inference Execution — ✅ Implemented

Attribute Detail
Responsibility Run the loaded model against assembled feature vectors; return probability distribution
Model loading Lazy, once per worker process; resolved from MLflow Registry champion alias
Outputs Probability vector [p_home_win, p_draw, p_away_win]; model version metadata
Failure behavior Task fails; FastAPI returns 500/504 depending on mode

Task Dispatch (Sync vs Async) — ✅ Implemented (sync only)

Attribute Detail
Responsibility Route inference requests to Celery ml queue; manage timeout for sync path
Endpoint GET /predict/{match_id} — dispatches to Celery ml queue
Sync timeout 30 s (configurable)
Async path 📋 Planned — no POST /predict/async/ endpoint yet
Failure behavior Sync: 504 on timeout

Precomputed Prediction Lookup — ✅ Implemented

Attribute Detail
Responsibility Serve pre-computed batch predictions without live Celery dispatch
Endpoint GET /predict/precomputed/{match_id}
Source data/predictions/predictions.parquet (written by batch_inference DVC stage)
Auth X-API-Key header required
Failure behavior 404 if match not in parquet; fast path — no model inference

Card & Region-ROI Analytics — ✅ Implemented

Attribute Detail
Responsibility Serve card-level and region-level ROI aggregates for UI consumption
Endpoints GET /predict/cards/ — top-picks card summaries; GET /predict/region-roi/ — ROI breakdown by region
Auth X-API-Key header required
Source Computed from predictions.parquet at request time

Livescores Feed — ✅ Implemented

Attribute Detail
Responsibility Proxy or serve live match scores from external source
Endpoint GET /livescores/
Auth No auth required (public read)
Failure behavior Returns empty list / error on upstream failure

Health & Readiness — ✅ Implemented

Attribute Detail
Responsibility Expose liveness and readiness probes for container orchestration
Endpoints GET /healthcheck/ — liveness; GET /healthcheck/ready — readiness
Auth No auth required
Used by Kubernetes probes, Docker Compose health checks

Drift Monitoring Report — ✅ Implemented

Attribute Detail
Responsibility Serve latest Evidently drift report
Endpoint GET /monitoring/drift
Source reports/drift/latest.json (written by monitor_drift DVC stage)
Auth No auth required
Failure behavior 404/500 if report file absent

Telemetry — ✅ Implemented

Attribute Detail
Responsibility Capture and expose Prometheus metrics for all inference requests
Metrics (8 total) Request count, request latency histograms (p50/p95/p99), error rate, active tasks, queue depth, cache hit rate
Endpoint GET /metrics
Failure behavior Non-blocking; metrics collection failure does not affect inference