Skip to content

ML Limitations & Justified Improvements

Purpose

Document current ML limitations honestly, grouped by category, and identify future improvements that are concrete and justified — not speculative.

Implementation readiness for all items is tracked in Status and the Architecture Roadmap.


Data limitations

Single data source

Training data comes from WhoScored match statistics only. No player-level data (injuries, transfers, form), no referee assignment, no weather or pitch conditions. These factors are known to influence match outcomes and represent a ceiling on how much the current feature set can explain.

No bookmaker odds as input features

Bookmaker odds are used only as an external evaluation baseline, not as features. This maintains clean separation between prediction and market data. Adding odds as features would risk introducing market-calibrated information that could obscure whether the model has learned anything independently useful.

Historical depth bounded by scraper

Training data is limited to seasons covered by the WhoScored scraper. Competitions or historical periods not scraped are simply absent.


Feature limitations

No player-level or squad composition features

The current feature set is team-level only. Player absences (injury, suspension), transfer activity, and squad rotation have measurable effects on outcomes but are not yet modelled. This is a known gap and a planned improvement.

No live table / league position at prediction time

League table position at the time of prediction requires a careful point-in-time join to avoid using future standings. This join is not yet implemented safely. Adding it incorrectly would be a leakage risk.

H2H features are sparse for infrequently-meeting teams

Head-to-head rolling statistics are reliable only for teams with sufficient historical meetings. Rarely-meeting clubs (across-league cup ties, newly promoted teams) will have near-zero coverage for H2H features.


Validation and calibration limitations

No per-competition metric breakdown

Validation metrics are reported on the full held-out test set. A breakdown by competition type (e.g., top-flight vs. lower division, men's vs. women's) is not yet implemented. Different competition types may have systematically different calibration quality.

Single model per serving path

The serving path exposes one registered model. There is no A/B testing infrastructure or shadow-mode evaluation in place. New model versions can only be compared via offline metrics before promotion.

Post-hoc calibration is optional, not yet default

Calibrated classifier wrapping (CalibratedClassifierCV) is implemented in final_train but is not yet a required step in the pipeline. Whether calibration is applied depends on the calibration_config passed to make_final_train_run.


Model limitations

Match outcome prediction is an inherently noisy problem

Football match outcomes have a high stochastic component. Even with perfect information, a significant fraction of outcomes cannot be predicted from pre-match statistics alone. The model’s performance ceiling is bounded by this irreducible noise — not just feature coverage.

Log-loss gap vs. bookmaker benchmark

The current champion model (ˆlog-loss 1.006) has not yet closed the gap to the bookmaker benchmark (~0.97). This is expected at the smoke promotion tier. The gap reflects the bookmaker’s access to richer signals (market odds, injury intelligence, late team news) that the current feature set cannot match.

Class imbalance: draws are hardest to predict

Draws are the least frequent outcome (~25%) and the hardest to predict. Class weights are applied during training, but draw precision and recall remain below home-win and away-win performance. This is a fundamental property of the task, not only a model choice.

Single model architecture

Only XGBoost (gradient boosting) is used. No ensemble of diverse model types, no neural architecture experiments. This is a justified trade-off for a tabular dataset at current scale, but may be revisited if data volume grows substantially.

No uncertainty calibration per competition type

A single calibrated output is produced across all competitions. There is no per-league or per-tier recalibration. Teams in small or unusual competitions may have systematically miscalibrated probabilities due to sparse historical data.


Monitoring limitations

No automated drift detection

There is no active drift detection on prediction inputs or outputs. Evidently is designed and documented but not yet integrated. Feature distribution shift would not be detected until it produces visible metric degradation in offline evaluation.

No prediction confidence logging

The model’s output probabilities are served and cached, but per-request confidence logs are not persisted to a queryable store. This prevents retrospective analysis of model calibration in production without re-running offline evaluation.

No model-level alerting

Prometheus monitors service health (request rate, latency, error rate, queue depth), but there is no alert when model predictions change unexpectedly, when calibration degrades, or when the champion model has not been updated in N days.

Redis cache staleness on model promotion

When a new model is promoted to champion, existing Redis cache entries from the previous model are not invalidated. Stale predictions from the previous model can be served until the cache TTL expires. See Architecture: Known Limitations.


Feature freshness at inference

The serving layer reads pre-computed features from data/predictions/match_features.parquet (populated by the batch_inference DVC stage). If the pipeline has not run since the last matches were played, the features may be stale. There is no feature freshness guarantee or freshness check at request time.

Cold-start latency after pod restart

The model is lazy-loaded on first inference request after a worker restart (~1–2 s). Subsequent requests use the cached model. This is a known operational characteristic, not a model defect.

No online feature store

Features are file-based Parquet artifacts. For time-sensitive or high-frequency use cases a dedicated online feature store would be needed. This is not required at current scale.


Justified future improvements

These improvements are concrete, causally linked to the limitations above, and ordered by impact on model quality.

Improvement Limitation addressed Priority
Player-level and injury features Data / feature gap — measurable outcome effect High
Per-competition metric breakdown Validation gap — calibration may differ by tier Medium
Automated promotion policy Registry gap — manual gate is operationally fragile Near-term
Live table position feature (safe join) Feature gap — league position signal Medium
Feature freshness check at inference Serving reliability gap Medium
A/B testing / shadow serving Serving gap — no live evaluation of challengers Medium
Calibration as default pipeline step Calibration gap — ECE currently not enforced Medium
Evidently drift detection integration Monitoring gap — no active drift detection Medium
Prediction log persistence Monitoring gap — no queryable confidence log Medium
Cache invalidation on model promotion Serving correctness gap — stale predictions possible Near-term

Items not listed here (e.g., streaming ingestion, transformer architectures, ensemble stacking) are exploratory or long-term and are tracked in Architecture: Roadmap rather than this page.