ML Limitations & Justified Improvements¶
Purpose¶
Document current ML limitations honestly, grouped by category, and identify future improvements that are concrete and justified — not speculative.
Implementation readiness for all items is tracked in Status and the Architecture Roadmap.
Data limitations¶
Single data source
Training data comes from WhoScored match statistics only. No player-level data (injuries, transfers, form), no referee assignment, no weather or pitch conditions. These factors are known to influence match outcomes and represent a ceiling on how much the current feature set can explain.
No bookmaker odds as input features
Bookmaker odds are used only as an external evaluation baseline, not as features. This maintains clean separation between prediction and market data. Adding odds as features would risk introducing market-calibrated information that could obscure whether the model has learned anything independently useful.
Historical depth bounded by scraper
Training data is limited to seasons covered by the WhoScored scraper. Competitions or historical periods not scraped are simply absent.
Feature limitations¶
No player-level or squad composition features
The current feature set is team-level only. Player absences (injury, suspension), transfer activity, and squad rotation have measurable effects on outcomes but are not yet modelled. This is a known gap and a planned improvement.
No live table / league position at prediction time
League table position at the time of prediction requires a careful point-in-time join to avoid using future standings. This join is not yet implemented safely. Adding it incorrectly would be a leakage risk.
H2H features are sparse for infrequently-meeting teams
Head-to-head rolling statistics are reliable only for teams with sufficient historical meetings. Rarely-meeting clubs (across-league cup ties, newly promoted teams) will have near-zero coverage for H2H features.
Validation and calibration limitations¶
No per-competition metric breakdown
Validation metrics are reported on the full held-out test set. A breakdown by competition type (e.g., top-flight vs. lower division, men's vs. women's) is not yet implemented. Different competition types may have systematically different calibration quality.
Single model per serving path
The serving path exposes one registered model. There is no A/B testing infrastructure or shadow-mode evaluation in place. New model versions can only be compared via offline metrics before promotion.
Post-hoc calibration is optional, not yet default
Calibrated classifier wrapping (CalibratedClassifierCV) is implemented in final_train
but is not yet a required step in the pipeline. Whether calibration is applied depends on
the calibration_config passed to make_final_train_run.
Model limitations¶
Match outcome prediction is an inherently noisy problem
Football match outcomes have a high stochastic component. Even with perfect information, a significant fraction of outcomes cannot be predicted from pre-match statistics alone. The model’s performance ceiling is bounded by this irreducible noise — not just feature coverage.
Log-loss gap vs. bookmaker benchmark
The current champion model (ˆlog-loss 1.006) has not yet closed the gap to the bookmaker benchmark (~0.97).
This is expected at the smoke promotion tier. The gap reflects the bookmaker’s access to
richer signals (market odds, injury intelligence, late team news) that the current feature set cannot match.
Class imbalance: draws are hardest to predict
Draws are the least frequent outcome (~25%) and the hardest to predict. Class weights are applied during training, but draw precision and recall remain below home-win and away-win performance. This is a fundamental property of the task, not only a model choice.
Single model architecture
Only XGBoost (gradient boosting) is used. No ensemble of diverse model types, no neural architecture experiments. This is a justified trade-off for a tabular dataset at current scale, but may be revisited if data volume grows substantially.
No uncertainty calibration per competition type
A single calibrated output is produced across all competitions. There is no per-league or per-tier recalibration. Teams in small or unusual competitions may have systematically miscalibrated probabilities due to sparse historical data.
Monitoring limitations¶
No automated drift detection
There is no active drift detection on prediction inputs or outputs. Evidently is designed and documented but not yet integrated. Feature distribution shift would not be detected until it produces visible metric degradation in offline evaluation.
No prediction confidence logging
The model’s output probabilities are served and cached, but per-request confidence logs are not persisted to a queryable store. This prevents retrospective analysis of model calibration in production without re-running offline evaluation.
No model-level alerting
Prometheus monitors service health (request rate, latency, error rate, queue depth), but there is no alert when model predictions change unexpectedly, when calibration degrades, or when the champion model has not been updated in N days.
Redis cache staleness on model promotion
When a new model is promoted to champion, existing Redis cache entries from the previous
model are not invalidated. Stale predictions from the previous model can be served until the
cache TTL expires. See Architecture: Known Limitations.
Serving-related ML limitations¶
Feature freshness at inference
The serving layer reads pre-computed features from data/predictions/match_features.parquet
(populated by the batch_inference DVC stage). If the pipeline has not run since the last
matches were played, the features may be stale. There is no feature freshness guarantee or
freshness check at request time.
Cold-start latency after pod restart
The model is lazy-loaded on first inference request after a worker restart (~1–2 s). Subsequent requests use the cached model. This is a known operational characteristic, not a model defect.
No online feature store
Features are file-based Parquet artifacts. For time-sensitive or high-frequency use cases a dedicated online feature store would be needed. This is not required at current scale.
Justified future improvements¶
These improvements are concrete, causally linked to the limitations above, and ordered by impact on model quality.
| Improvement | Limitation addressed | Priority |
|---|---|---|
| Player-level and injury features | Data / feature gap — measurable outcome effect | High |
| Per-competition metric breakdown | Validation gap — calibration may differ by tier | Medium |
| Automated promotion policy | Registry gap — manual gate is operationally fragile | Near-term |
| Live table position feature (safe join) | Feature gap — league position signal | Medium |
| Feature freshness check at inference | Serving reliability gap | Medium |
| A/B testing / shadow serving | Serving gap — no live evaluation of challengers | Medium |
| Calibration as default pipeline step | Calibration gap — ECE currently not enforced | Medium |
| Evidently drift detection integration | Monitoring gap — no active drift detection | Medium |
| Prediction log persistence | Monitoring gap — no queryable confidence log | Medium |
| Cache invalidation on model promotion | Serving correctness gap — stale predictions possible | Near-term |
Items not listed here (e.g., streaming ingestion, transformer architectures, ensemble stacking) are exploratory or long-term and are tracked in Architecture: Roadmap rather than this page.