Architecture Roadmap¶

This page documents planned architectural improvements in priority order. All items on this page are 📋 Planned — none are implemented unless otherwise stated.

The roadmap is driven by engineering maturity gaps, not feature requests. Each item is justified by a concrete architectural need, not speculative scope expansion.

The v1.0 deliverables below are the binding scope for the next 1–2 weeks. The Near-term / Mid-term / Long-term sections that follow are the post-v1 backlog and are explicitly out of scope for v1.0.

v1.0 — Demo Track (✅ completed May 2026)¶

All items below were the binding Definition of Done for the initial release cycle. All are completed as of May 24, 2026. The Near-term / Mid-term / Long-term sections that follow are the post-v1 backlog.

Definition of Done — v1.0 checklist:

#	Criterion	Verification
DoD-01	Public read-only Streamlit UI lists matches with champion-model 1x2 predictions	Visit deployed UI; predictions render without operator action
DoD-02	UI shows historical quality metrics (accuracy, log-loss, calibration, ROI) as information only	UI page renders metrics from MLflow / evaluation reports
DoD-03	Champion model trained with production-scale parameters (not smoke mode)	`params.yaml` review + MLflow run tags
DoD-04	`docs/status.md` and code agree; `tests/contract/test_pipeline_contracts.py` is green	`pytest tests/contract/ -q` + manual cross-check
DoD-05	Public deployment has "demo only" disclaimer and nginx-level rate limiting	Inspect rendered UI footer + nginx config
DoD-06	Quickstart executes end-to-end from a clean checkout	`docs/quickstart.md` dry-run

v1.1 Public prediction UI (DoD-01, DoD-02)¶

Architectural reason: A read-only Streamlit UI that lists matches and renders the champion-model 1x2 prediction is the system's user-facing demonstration of the end-to-end pipeline. Without it, the value of the data, training, and serving layers is invisible to a non-operator visitor.

Delivered: src/streamlit/main.py — match list with champion-model 1x2 predictions, Fonbet odds, Value-bet signal (>5 pp edge), Pred accuracy per match, dynamic ROI panel (Accuracy / ROI all picks / ROI value bets), Min region ROI slider, and filters (Region / Status / Period). APIClient covers all /predict/*, /livescores/, and /predict/region-roi/ endpoints. Demo disclaimer rendered on every page.

v1.2 Production training parameters (DoD-03)¶

Architectural reason: The current params.yaml is in smoke mode (classification.fracs_for_train=[0.001, 0.002], tuning.n_trials=2). A model trained with these parameters cannot be honestly described as a champion. v1.0 requires the registered champion to be trained with parameters representative of the production regime.

Delivered: Production-scale parameters active (classification.frac=0.01, tuning.n_trials=20); champion registered in a non-smoke experiment via a full dvc repro cycle.

v1.3 Docs ↔ code reconciliation (DoD-04)¶

Architectural reason: Several docs/status.md claims contradict the code (UI Streamlit predictions claim, GE-gate naming) and the contract test in tests/contract/test_pipeline_contracts.py is CI-red because EXPECTED_STAGES references validate_interim, which is absent from dvc.yaml. Documentation that contradicts code is worse than no documentation.

Delivered: All known contradictions resolved; tests/contract/test_pipeline_contracts.py is green; every ✅ Operational claim in docs/status.md is supported by code.

v1.4 Public-surface guardrails (DoD-05)¶

Architectural reason: The public deployment is intentionally unauthenticated (see Non-Goals). To make this safe, the surface must be read-only, rate-limited, and clearly labelled as a demo.

Delivered: nginx ingress rate limiting configured; "demo only — not betting advice" disclaimer rendered on every UI page; CORS narrowed to the deployed UI origin.

Explicitly deferred from v1.0 (kept here for traceability): champion-vs-challenger gate (R6), model hot-reload on alias change (R3), automated retrain DAG (R2 / R5 / D-03), Evidently drift detection (R7), Grafana dashboards + Prometheus alerting (OPS-04, OR-04), authenticated /predict/* (SRV-01), online model selection from UI, neural-network challengers. These remain in the Near-term / Mid-term / Long-term sections below.

Near-term (0–3 months, post-v1)¶

1. Automated Staging → Production Promotion Policy¶

Current state: Model promotion from Staging to Production (champion alias) is manual. A reviewer must inspect MLflow metrics and manually update the alias.

Problem: Manual gates are reliable only when followed consistently. A promotion without review degrades model quality silently.

Target: Define an explicit metric threshold policy (e.g., log_loss < X on holdout set) enforced by the register_model DVC stage or a post-training CI step. The system should block promotion if the policy is not met, and optionally notify the operator.

Scope: src/pipelines/register_model.py + MLflow client automation + CI gate.

2. Grafana Dashboards¶

Architectural reason: Observability is a stated quality attribute of this system. Prometheus metrics are already collected; the gap is visualization. Without dashboards, the observability layer is instrumented but not operationally usable.

Current state: Prometheus collects metrics across FastAPI, Celery workers, RabbitMQ, and cluster infrastructure. Grafana is deployed but dashboards are not defined.

Problem: Metrics are not actionable without a dashboard — an operator cannot assess service health at a glance.

Target: Define and provision at minimum: - Inference service dashboard (request rate, p50/p95 latency, error rate, cache hit ratio) - Celery queue dashboard (queue depth per queue, task processing rate) - Infrastructure dashboard (CPU, memory, node metrics from kube-state-metrics + node-exporter)

Scope: Grafana dashboard JSON definitions in k8s/helm/monitoring/.

3. Prometheus Alerting Rules¶

Architectural reason: The system's reliability requirement depends on detecting failures before they become extended outages. Purely reactive detection via manual inspection does not meet single-maintainer operability requirements.

Current state: Prometheus scrapes metrics but no alerting rules are configured. Failures are detected reactively (Airflow UI, K8s events, or manual log inspection).

Target: Define alerting rules for: - API error rate > threshold - Celery queue depth > threshold (stuck inference) - No scraping job completed in 24 h - Pod CrashLoopBackOff

Scope: Prometheus alerting rules in k8s/helm/monitoring/.

Mid-term (3–9 months)¶

4. Evidently Offline Drift Reports¶

Current state: Drift detection is architecturally designed but not implemented. The system logs prediction inputs but does not analyze distribution shifts.

Target: Scheduled batch job (Airflow DAG) that: 1. Loads recent prediction inputs from PostgreSQL or MinIO. 2. Runs Evidently comparison against the training data distribution. 3. Writes HTML report to MinIO. 4. Links report from docs/evidence/monitoring.md.

Scope: New Airflow DAG + src/monitoring/drift.py + MinIO artifact store + MkDocs link.

Not yet: No automated retraining trigger based on drift (see item 5).

5. Formalized Retraining Triggers¶

Architectural reason: The system's prediction quality degrades over time as match statistics evolve (team form, tactical changes, new seasons). Without a defined trigger, the model training cadence is undocumented, ad hoc, and dependent on operator judgment rather than system policy.

Current state: Retraining is manual — the operator runs dvc repro when new data is available.

Target: Define and implement at least one of: - Time-based trigger (Airflow DAG at fixed cadence: weekly/monthly). - Data-volume trigger (new N matches ingested since last training run). - Drift trigger (Evidently report exceeds threshold — depends on item 4).

Scope: Airflow DAG + trigger condition logic + CI/CD integration with dvc repro.

6. Cache Invalidation on Model Promotion¶

Current state: Redis cache is TTL-based. When a new model is promoted to champion, stale predictions from the previous model remain in cache until TTL expires.

Target: On model promotion, emit an event (or hook) that flushes the Redis prediction cache. Mechanism: post-promotion script or Celery task triggered by registry alias change.

Scope: src/app/tasks/ + model registration script.

Long-term (9+ months)¶

7. High-Availability Kubernetes (if scale justifies)¶

Current state: Single-node K8s on healserver. No HA.

Consideration: If prediction volume or data ingestion frequency grows significantly, or if the project moves toward multi-user / multi-tenant serving, a managed K8s cluster (GKE, EKS, or AKS) would provide automatic failover, node autoscaling, and managed control plane.

Decision criteria: Volume > ~1,000 requests/day, or sustained operational issues with single-node.

Note: Helm charts are already parameterized for portability. Migration requires only config changes.

8. Online Feature Store¶

Current state: Features are assembled at inference time from historical rolling statistics. This works for the current prediction horizon (future matches known in advance).

Consideration: If the prediction use case expands to include in-game or near-real-time events, an online feature store (e.g., Feast, Hopsworks, or Redis-backed feature registry) would provide low-latency feature retrieval without repeated computation.

Decision criteria: Use case requires features updated faster than batch pipeline cadence.

9. Streaming Ingestion¶

Current state: Data ingested in scheduled batches (Airflow DAG → Selenoid → PostgreSQL). This matches the current prediction use case: future matches are known in advance and predictions do not need to respond to sub-hour events.

Consideration: Only justified if the prediction use case changes to require in-game or near-real-time event data. No such requirement exists today.

Decision criteria: New prediction targets requiring sub-hour data freshness AND a data provider that supports streaming delivery. Both conditions must hold; absent them, batch ingestion is correct.

What Is Not on the Roadmap¶

Betting execution or portfolio management automation.
Support for sports other than football.
Multi-tenant user management or per-user prediction APIs.
Real-time UI beyond the existing Streamlit interface.

Implementation Status — current state of all components
Failure Modes — gaps addressed by near-term items
Trade-offs — decisions that constrain or enable roadmap items
Architecture Principles — principles that govern prioritization