Monitoring¶

Data Drift¶

Feature-drift and prediction-drift detection using Evidently.

Pure compute module — no IO, no MLflow, no Prometheus. Callers are responsible for loading data and persisting outputs.

Public API¶

compute_drift(reference_df, current_df) -> DriftResult compute_prediction_drift(reference_df, current_df) -> PredictionDriftResult

Note: uses evidently.legacy.* which is the Evidently 0.7.x compat layer.

`DriftResult` `dataclass` ¶

Results of a single feature-drift analysis run.

Attributes¶

drift_score: Share of features that have drifted (Evidently share_of_drifted_columns). Range [0, 1]. Values > 0.2 are considered significant. n_features: Total number of features evaluated. n_drifted: Number of features whose drift was detected. html_report: Full Evidently HTML report as bytes. Suitable for writing to a file.

Source code in src/monitoring/drift.py

@dataclass(frozen=True)
class DriftResult:
    """Results of a single feature-drift analysis run.

    Attributes
    ----------
    drift_score:
        Share of features that have drifted
        (Evidently ``share_of_drifted_columns``).
        Range [0, 1]. Values > 0.2 are considered significant.
    n_features:
        Total number of features evaluated.
    n_drifted:
        Number of features whose drift was detected.
    html_report:
        Full Evidently HTML report as bytes. Suitable for writing to a file.
    """

    drift_score: float
    n_features: int
    n_drifted: int
    html_report: bytes

`PredictionDriftResult` `dataclass` ¶

Results of a prediction-distribution drift analysis.

Attributes¶

prediction_drift_score: Mean drift score across the three probability columns. n_drifted_cols: Number of probability columns (0–3) where drift was detected. html_report: Full Evidently HTML report as bytes.

Source code in src/monitoring/drift.py

@dataclass(frozen=True)
class PredictionDriftResult:
    """Results of a prediction-distribution drift analysis.

    Attributes
    ----------
    prediction_drift_score:
        Mean drift score across the three probability columns.
    n_drifted_cols:
        Number of probability columns (0–3) where drift was detected.
    html_report:
        Full Evidently HTML report as bytes.
    """

    prediction_drift_score: float
    n_drifted_cols: int
    html_report: bytes

`compute_drift(reference_df, current_df, *, stattest_threshold=0.05)` ¶

Run Evidently dataset drift analysis on input features.

Parameters¶

reference_df: Baseline feature DataFrame (e.g. a sample from training). current_df: Recent production feature DataFrame to compare against the baseline. stattest_threshold: p-value threshold for per-feature drift tests (default: 0.05).

Returns¶

DriftResult

Source code in src/monitoring/drift.py

def compute_drift(
    reference_df: pd.DataFrame,
    current_df: pd.DataFrame,
    *,
    stattest_threshold: float = 0.05,
) -> DriftResult:
    """Run Evidently dataset drift analysis on input features.

    Parameters
    ----------
    reference_df:
        Baseline feature DataFrame (e.g. a sample from training).
    current_df:
        Recent production feature DataFrame to compare against the baseline.
    stattest_threshold:
        p-value threshold for per-feature drift tests (default: 0.05).

    Returns
    -------
    DriftResult
    """
    report = Report(
        metrics=[
            DatasetDriftMetric(stattest_threshold=stattest_threshold),
        ]
    )
    report.run(reference_data=reference_df, current_data=current_df)

    result: dict = report.as_dict()["metrics"][0]["result"]

    drift_score: float = float(result["share_of_drifted_columns"])
    n_features: int = int(result["number_of_columns"])
    n_drifted: int = int(result["number_of_drifted_columns"])

    html_bytes: bytes = report.get_html().encode("utf-8")

    return DriftResult(
        drift_score=drift_score,
        n_features=n_features,
        n_drifted=n_drifted,
        html_report=html_bytes,
    )

`compute_prediction_drift(reference_df, current_df, *, stattest_threshold=0.05)` ¶

Detect distribution shift in model output probabilities.

Compares the distributions of proba_home, proba_draw, and proba_away between a reference period and the current window using Evidently ColumnDriftMetric.

Parameters¶

reference_df: Predictions DataFrame for the reference period. Must contain proba_home, proba_draw, proba_away columns. current_df: Predictions DataFrame for the current window. stattest_threshold: p-value threshold for per-column drift tests (default: 0.05).

Returns¶

PredictionDriftResult

Source code in src/monitoring/drift.py

def compute_prediction_drift(
    reference_df: pd.DataFrame,
    current_df: pd.DataFrame,
    *,
    stattest_threshold: float = 0.05,
) -> PredictionDriftResult:
    """Detect distribution shift in model output probabilities.

    Compares the distributions of ``proba_home``, ``proba_draw``, and
    ``proba_away`` between a reference period and the current window using
    Evidently ``ColumnDriftMetric``.

    Parameters
    ----------
    reference_df:
        Predictions DataFrame for the reference period.  Must contain
        ``proba_home``, ``proba_draw``, ``proba_away`` columns.
    current_df:
        Predictions DataFrame for the current window.
    stattest_threshold:
        p-value threshold for per-column drift tests (default: 0.05).

    Returns
    -------
    PredictionDriftResult
    """
    available = [
        c for c in _PROBA_COLS if c in reference_df.columns and c in current_df.columns
    ]
    if not available:
        raise ValueError(
            f"None of {_PROBA_COLS} found in both reference and current DataFrames."
        )

    metrics = [
        ColumnDriftMetric(column_name=col, stattest_threshold=stattest_threshold)
        for col in available
    ]
    report = Report(metrics=metrics)
    report.run(
        reference_data=reference_df[available], current_data=current_df[available]
    )

    raw = report.as_dict()["metrics"]
    scores = [float(m["result"]["drift_score"]) for m in raw]
    drifted = [bool(m["result"]["drift_detected"]) for m in raw]

    mean_score = sum(scores) / len(scores)
    n_drifted_cols = sum(drifted)
    html_bytes = report.get_html().encode("utf-8")

    return PredictionDriftResult(
        prediction_drift_score=mean_score,
        n_drifted_cols=n_drifted_cols,
        html_report=html_bytes,
    )

ML Quality¶

ML quality metrics: log-loss, ECE, and hit-rate.

Pure compute module — no IO, no MLflow, no Prometheus, no side effects. Callers are responsible for loading data and persisting outputs.

Public API¶

compute_ml_quality(y_true, y_proba, label_order) -> MLQualityResult

`MLQualityResult` `dataclass` ¶

Quality metrics for a batch of finished-match predictions.

Attributes¶

n_matches: Number of finished matches in the evaluated window. logloss: Multi-class log-loss (lower is better; random baseline ≈ 1.099). ece: Expected Calibration Error on the predicted (argmax) outcome. Range [0, 1]. Values > 0.05 indicate significant miscalibration. hit_rate: Fraction of matches where the predicted class equals the outcome. hit_rate_home, hit_rate_draw, hit_rate_away: Per-outcome hit rate (correct predictions / total predictions for that class). mean_confidence: Mean maximum probability across all predictions (model's average certainty).

Source code in src/monitoring/ml_quality.py

@dataclass(frozen=True)
class MLQualityResult:
    """Quality metrics for a batch of finished-match predictions.

    Attributes
    ----------
    n_matches:
        Number of finished matches in the evaluated window.
    logloss:
        Multi-class log-loss (lower is better; random baseline ≈ 1.099).
    ece:
        Expected Calibration Error on the predicted (argmax) outcome.
        Range [0, 1]. Values > 0.05 indicate significant miscalibration.
    hit_rate:
        Fraction of matches where the predicted class equals the outcome.
    hit_rate_home, hit_rate_draw, hit_rate_away:
        Per-outcome hit rate (correct predictions / total predictions
        for that class).
    mean_confidence:
        Mean maximum probability across all predictions
        (model's average certainty).
    """

    n_matches: int
    logloss: float
    ece: float
    hit_rate: float
    hit_rate_home: float
    hit_rate_draw: float
    hit_rate_away: float
    mean_confidence: float

`compute_ml_quality(y_true, y_proba, label_order)` ¶

Compute classification quality metrics.

Parameters¶

y_true: 1-D array of ground-truth class labels (integers matching label_order). y_proba: 2-D array of predicted probabilities, shape (n_samples, n_classes). Column order must match label_order. label_order: List of class integers in the same order as y_proba columns. For this project: [0, 1, 2] (0=home_win, 1=draw, 2=away_win).

Returns¶

MLQualityResult

Source code in src/monitoring/ml_quality.py

def compute_ml_quality(
    y_true: np.ndarray,
    y_proba: np.ndarray,
    label_order: list[int],
) -> MLQualityResult:
    """Compute classification quality metrics.

    Parameters
    ----------
    y_true:
        1-D array of ground-truth class labels (integers matching *label_order*).
    y_proba:
        2-D array of predicted probabilities, shape (n_samples, n_classes).
        Column order must match *label_order*.
    label_order:
        List of class integers in the same order as *y_proba* columns.
        For this project: ``[0, 1, 2]`` (0=home_win, 1=draw, 2=away_win).

    Returns
    -------
    MLQualityResult
    """
    if len(y_true) == 0:
        raise ValueError("y_true is empty — no finished matches to evaluate.")
    if y_proba.ndim != 2 or y_proba.shape[1] != len(label_order):
        raise ValueError(
            f"y_proba shape {y_proba.shape} does not match label_order length {len(label_order)}."
        )

    n_matches = len(y_true)
    logloss = float(log_loss(y_true, y_proba, labels=label_order))
    ece = _compute_ece(y_true, y_proba)

    y_pred = y_proba.argmax(axis=1)
    # Map argmax column index → actual class label
    y_pred_labels = np.array([label_order[i] for i in y_pred])

    hit_rate = float((y_pred_labels == y_true).mean())

    label_map = {lbl: idx for idx, lbl in enumerate(label_order)}
    home_lbl = label_map.get(0, 0)
    draw_lbl = label_map.get(1, 1)
    away_lbl = label_map.get(2, 2)

    hit_rate_home = _hit_rate_by_class(y_true, y_pred_labels, label_order[home_lbl])
    hit_rate_draw = _hit_rate_by_class(y_true, y_pred_labels, label_order[draw_lbl])
    hit_rate_away = _hit_rate_by_class(y_true, y_pred_labels, label_order[away_lbl])
    mean_confidence = float(y_proba.max(axis=1).mean())

    return MLQualityResult(
        n_matches=n_matches,
        logloss=logloss,
        ece=ece,
        hit_rate=hit_rate,
        hit_rate_home=hit_rate_home,
        hit_rate_draw=hit_rate_draw,
        hit_rate_away=hit_rate_away,
        mean_confidence=mean_confidence,
    )

Monitoring¶

Data Drift¶

Public API¶

DriftResult dataclass ¶

Attributes¶

PredictionDriftResult dataclass ¶

Attributes¶

compute_drift(reference_df, current_df, *, stattest_threshold=0.05) ¶

Parameters¶

Returns¶

compute_prediction_drift(reference_df, current_df, *, stattest_threshold=0.05) ¶

Parameters¶

Returns¶

ML Quality¶

Public API¶

MLQualityResult dataclass ¶

Attributes¶

compute_ml_quality(y_true, y_proba, label_order) ¶

Parameters¶

Returns¶

`DriftResult` `dataclass` ¶

`PredictionDriftResult` `dataclass` ¶

`compute_drift(reference_df, current_df, *, stattest_threshold=0.05)` ¶

`compute_prediction_drift(reference_df, current_df, *, stattest_threshold=0.05)` ¶

`MLQualityResult` `dataclass` ¶

`compute_ml_quality(y_true, y_proba, label_order)` ¶