Architecture Roadmap¶
This page documents planned architectural improvements in priority order. All items on this page are 📋 Planned — none are implemented unless otherwise stated.
The roadmap is driven by engineering maturity gaps, not feature requests. Each item is justified by a concrete architectural need, not speculative scope expansion.
The v1.0 deliverables below are the binding scope for the next 1–2 weeks. The Near-term / Mid-term / Long-term sections that follow are the post-v1 backlog and are explicitly out of scope for v1.0.
v1.0 — Demo Track (✅ completed May 2026)¶
All items below were the binding Definition of Done for the initial release cycle. All are completed as of May 24, 2026. The Near-term / Mid-term / Long-term sections that follow are the post-v1 backlog.
Definition of Done — v1.0 checklist:
| # | Criterion | Verification |
|---|---|---|
| DoD-01 | Public read-only Streamlit UI lists matches with champion-model 1x2 predictions | Visit deployed UI; predictions render without operator action |
| DoD-02 | UI shows historical quality metrics (accuracy, log-loss, calibration, ROI) as information only | UI page renders metrics from MLflow / evaluation reports |
| DoD-03 | Champion model trained with production-scale parameters (not smoke mode) | params.yaml review + MLflow run tags |
| DoD-04 | docs/status.md and code agree; tests/contract/test_pipeline_contracts.py is green |
pytest tests/contract/ -q + manual cross-check |
| DoD-05 | Public deployment has "demo only" disclaimer and nginx-level rate limiting | Inspect rendered UI footer + nginx config |
| DoD-06 | Quickstart executes end-to-end from a clean checkout | docs/quickstart.md dry-run |
v1.1 Public prediction UI (DoD-01, DoD-02)¶
Architectural reason: A read-only Streamlit UI that lists matches and renders the champion-model 1x2 prediction is the system's user-facing demonstration of the end-to-end pipeline. Without it, the value of the data, training, and serving layers is invisible to a non-operator visitor.
Delivered: src/ui/app/main.py — match list with champion-model 1x2 predictions, Fonbet odds, Value-bet signal (>5 pp edge), Pred accuracy per match, dynamic ROI panel (Accuracy / ROI all picks / ROI value bets), Min region ROI slider, and filters (Region / Status / Period). APIClient covers all /predict/*, /livescores/, and /predict/region-roi/ endpoints. Demo disclaimer rendered on every page.
v1.2 Production training parameters (DoD-03)¶
Architectural reason: The current params.yaml is in smoke mode (classification.fracs_for_train=[0.001, 0.002], tuning.n_trials=2). A model trained with these parameters cannot be honestly described as a champion. v1.0 requires the registered champion to be trained with parameters representative of the production regime.
Delivered: Production-scale parameters active (classification.frac=0.01, tuning.n_trials=20); champion registered in a non-smoke experiment via a full dvc repro cycle.
v1.3 Docs ↔ code reconciliation (DoD-04)¶
Architectural reason: Several docs/status.md claims contradict the code (UI Streamlit predictions claim, GE-gate naming) and the contract test in tests/contract/test_pipeline_contracts.py is CI-red because EXPECTED_STAGES references validate_interim, which is absent from dvc.yaml. Documentation that contradicts code is worse than no documentation.
Delivered: All known contradictions resolved; tests/contract/test_pipeline_contracts.py is green; every ✅ Operational claim in docs/status.md is supported by code.
v1.4 Public-surface guardrails (DoD-05)¶
Architectural reason: The public deployment is intentionally unauthenticated (see Non-Goals). To make this safe, the surface must be read-only, rate-limited, and clearly labelled as a demo.
Delivered: nginx ingress rate limiting configured; "demo only — not betting advice" disclaimer rendered on every UI page; CORS narrowed to the deployed UI origin.
Explicitly deferred from v1.0 (kept here for traceability): champion-vs-challenger gate (R6), model hot-reload on alias change (R3), automated retrain DAG (R2 / R5 / D-03), Evidently drift detection (R7), Grafana dashboards + Prometheus alerting (OPS-04, OR-04), authenticated
/predict/*(SRV-01), online model selection from UI, neural-network challengers. These remain in the Near-term / Mid-term / Long-term sections below.
Near-term (0–3 months, post-v1)¶
1. Automated Staging → Production Promotion Policy¶
Current state: Model promotion from Staging to Production (champion alias) is manual.
A reviewer must inspect MLflow metrics and manually update the alias.
Problem: Manual gates are reliable only when followed consistently. A promotion without review degrades model quality silently.
Target: Define an explicit metric threshold policy (e.g., log_loss < X on holdout set)
enforced by the register_model DVC stage or a post-training CI step.
The system should block promotion if the policy is not met, and optionally notify the operator.
Scope: src/pipelines/register_model.py + MLflow client automation + CI gate.
2. Grafana Dashboards¶
Architectural reason: Observability is a stated quality attribute of this system. Prometheus metrics are already collected; the gap is visualization. Without dashboards, the observability layer is instrumented but not operationally usable.
Current state: Prometheus collects metrics across FastAPI, Celery workers, RabbitMQ, and cluster infrastructure. Grafana is deployed but dashboards are not defined.
Problem: Metrics are not actionable without a dashboard — an operator cannot assess service health at a glance.
Target: Define and provision at minimum: - Inference service dashboard (request rate, p50/p95 latency, error rate, cache hit ratio) - Celery queue dashboard (queue depth per queue, task processing rate) - Infrastructure dashboard (CPU, memory, node metrics from kube-state-metrics + node-exporter)
Scope: Grafana dashboard JSON definitions in k8s/helm/monitoring/.
3. Prometheus Alerting Rules¶
Architectural reason: The system's reliability requirement depends on detecting failures before they become extended outages. Purely reactive detection via manual inspection does not meet single-maintainer operability requirements.
Current state: Prometheus scrapes metrics but no alerting rules are configured. Failures are detected reactively (Airflow UI, K8s events, or manual log inspection).
Target: Define alerting rules for: - API error rate > threshold - Celery queue depth > threshold (stuck inference) - No scraping job completed in 24 h - Pod CrashLoopBackOff
Scope: Prometheus alerting rules in k8s/helm/monitoring/.
Mid-term (3–9 months)¶
4. Evidently Offline Drift Reports¶
Current state: Drift detection is architecturally designed but not implemented. The system logs prediction inputs but does not analyze distribution shifts.
Target: Scheduled batch job (Airflow DAG) that:
1. Loads recent prediction inputs from PostgreSQL or MinIO.
2. Runs Evidently comparison against the training data distribution.
3. Writes HTML report to MinIO.
4. Links report from docs/evidence/monitoring.md.
Scope: New Airflow DAG + src/monitoring/drift.py + MinIO artifact store + MkDocs link.
Not yet: No automated retraining trigger based on drift (see item 5).
5. Formalized Retraining Triggers¶
Architectural reason: The system's prediction quality degrades over time as match statistics evolve (team form, tactical changes, new seasons). Without a defined trigger, the model training cadence is undocumented, ad hoc, and dependent on operator judgment rather than system policy.
Current state: Retraining is manual — the operator runs dvc repro when new data is available.
Target: Define and implement at least one of: - Time-based trigger (Airflow DAG at fixed cadence: weekly/monthly). - Data-volume trigger (new N matches ingested since last training run). - Drift trigger (Evidently report exceeds threshold — depends on item 4).
Scope: Airflow DAG + trigger condition logic + CI/CD integration with dvc repro.
6. Cache Invalidation on Model Promotion¶
Current state: Redis cache is TTL-based. When a new model is promoted to champion,
stale predictions from the previous model remain in cache until TTL expires.
Target: On model promotion, emit an event (or hook) that flushes the Redis prediction cache. Mechanism: post-promotion script or Celery task triggered by registry alias change.
Scope: src/app/tasks/ + model registration script.
Long-term (9+ months)¶
7. High-Availability Kubernetes (if scale justifies)¶
Current state: Single-node K8s on healserver. No HA.
Consideration: If prediction volume or data ingestion frequency grows significantly, or if the project moves toward multi-user / multi-tenant serving, a managed K8s cluster (GKE, EKS, or AKS) would provide automatic failover, node autoscaling, and managed control plane.
Decision criteria: Volume > ~1,000 requests/day, or sustained operational issues with single-node.
Note: Helm charts are already parameterized for portability. Migration requires only config changes.
8. Online Feature Store¶
Current state: Features are assembled at inference time from historical rolling statistics. This works for the current prediction horizon (future matches known in advance).
Consideration: If the prediction use case expands to include in-game or near-real-time events, an online feature store (e.g., Feast, Hopsworks, or Redis-backed feature registry) would provide low-latency feature retrieval without repeated computation.
Decision criteria: Use case requires features updated faster than batch pipeline cadence.
9. Streaming Ingestion¶
Current state: Data ingested in scheduled batches (Airflow DAG → Selenoid → PostgreSQL). This matches the current prediction use case: future matches are known in advance and predictions do not need to respond to sub-hour events.
Consideration: Only justified if the prediction use case changes to require in-game or near-real-time event data. No such requirement exists today.
Decision criteria: New prediction targets requiring sub-hour data freshness AND a data provider that supports streaming delivery. Both conditions must hold; absent them, batch ingestion is correct.
What Is Not on the Roadmap¶
- Betting execution or portfolio management automation.
- Support for sports other than football.
- Multi-tenant user management or per-user prediction APIs.
- Real-time UI beyond the existing Streamlit interface.
Related¶
- Implementation Status — current state of all components
- Failure Modes — gaps addressed by near-term items
- Trade-offs — decisions that constrain or enable roadmap items
- Architecture Principles — principles that govern prioritization