Shared & Utilities¶
Shared Config¶
Shared infrastructure configuration — single source of truth for all layers.
Contains only the four env vars that every layer needs: - MLflow tracking URI - MinIO endpoint URL + credentials
Both src.pipelines._config.PipelineConfig and src.app.config
sub-classes extend or re-declare these same env-var aliases, so the env-var
contract is defined once here and referenced everywhere.
SharedInfraConfig
¶
Bases: BaseSettings
Minimum infrastructure settings shared by all layers.
Subclass this in layer-specific settings classes; do not instantiate
directly except via get_shared_config().
Field naming uses mlflow_* / minio_* prefixes to avoid name
collisions when composed into larger settings objects.
src.app.config classes use shorter field names (tracking_uri,
access_key, ...) for ergonomics; they share only the env-var aliases.
Source code in src/shared/config.py
get_shared_config()
cached
¶
Return the shared infrastructure config singleton (lazy, cached).
Use get_shared_config.cache_clear() in tests to force re-instantiation
after changing env vars with monkeypatch.
Returns:
| Type | Description |
|---|---|
Singleton
|
class: |
Source code in src/shared/config.py
MLflow Metadata¶
MLflow metadata utilities: data lineage, pipeline context, model config.
Provides helpers to build consistent MLflow tag/param dicts so every run in the same experiment is unambiguously identifiable by its data snapshot, run mode, and model configuration.
Naming convention¶
Tags (context / lineage / run identity, not affecting model training):: data.version, data.hash, data.source_*, data.ingested_at, data.train_start, data.train_end, data.test_start, data.test_end, data.train_rows, data.test_rows, pipeline.git_sha, pipeline.dvc_exp_name, pipeline.params_hash
Params (configuration that directly affects the trained model):: model.name, model.target, model.feature_count, model.hyperparams_source
Experiment structure¶
Only two experiments are used:: matches_clf — production runs (train_eval, ablation, tuning, final_train) matches_clf_smoke — smoke / fast-dev runs
Note on data.source_created_at¶
The S3/MinIO HEAD-object response does NOT expose a distinct creation
timestamp — it only provides LastModified (the time the object was last
PUT/replaced). Therefore data.source_created_at cannot be determined
reliably and is intentionally NOT logged. data.source_last_modified is
the closest available proxy.
derive_features_profile(feat_params)
¶
Derive a human-readable feature-profile label from a params dict.
Logic mirrors the ablation variant naming convention:
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
feat_params
|
dict
|
Feature configuration dict with boolean keys
|
required |
Returns:
| Type | Description |
|---|---|
str
|
One of |
Source code in src/utils/mlflow_meta.py
set_experiment_active(experiment_name)
¶
Set the active MLflow experiment, restoring it first if it was deleted.
MLflow raises MlflowException when set_experiment is called on a
soft-deleted experiment. This helper detects that state and calls
MlflowClient.restore_experiment before delegating to the standard
mlflow.set_experiment.
Parameters¶
experiment_name: The experiment name to activate.
Returns¶
mlflow.entities.Experiment The active experiment object.
Source code in src/utils/mlflow_meta.py
build_pipeline_context_tags()
¶
Build MLflow tags describing the git/pipeline context of a run.
Returns:
| Type | Description |
|---|---|
Dict with string tag keys and values
|
|
dict[str, str]
|
|
dict[str, str]
|
|
dict[str, str]
|
|
dict[str, str]
|
|
Source code in src/utils/mlflow_meta.py
build_data_lineage_tags(dataset_path, df_train=None, df_test=None, dvc_hash=None, raw_source_path=None)
¶
Build MLflow tags describing dataset lineage for a training run.
Parameters¶
dataset_path:
Path to the processed dataset parquet (used to locate minio sidecar
if raw_source_path is not given).
df_train:
Training split DataFrame. If it contains startTimeUtc, temporal
bounds and row count are logged.
df_test:
Test/holdout split DataFrame. Same conditions as df_train.
dvc_hash:
DVC MD5 hash of the dataset artifact (from get_dvc_hash). Logged
as both data.version and data.hash.
raw_source_path:
Explicit path to the raw source parquet whose .minio.json sidecar
contains the MinIO object metadata. Defaults to
data/raw/match_raw.parquet relative to the project root.
Source code in src/utils/mlflow_meta.py
build_features_selected_params(feat_params)
¶
Build MLflow params describing the features_selected configuration.
Logs the exact feature set used for tuning, final_train, and inference so any downstream run can be reproduced without inspecting params.yaml.
Parameters¶
feat_params:
Dict from params["features_selected"] (or params["classification"]
as a fallback). Expected keys: side, window_sizes,
include_elo, include_rest_days, include_h2h, cat_cols.
Returns¶
dict[str, Any]
MLflow param dict keyed by features.* names.
Source code in src/utils/mlflow_meta.py
build_model_metadata_params(model_name, target, num_feature_count, cat_feature_count, best_params=None)
¶
Build MLflow params describing model configuration.
Parameters¶
model_name:
Short algorithm identifier (e.g. 'xgb', 'logreg').
target:
Target column name.
num_feature_count:
Number of numeric feature columns.
cat_feature_count:
Number of categorical feature columns.
best_params:
Tuned hyperparameter dict. When non-empty, model.hyperparams_source
is 'tuned'; otherwise 'default'.
Returns¶
dict[str, Any]
MLflow param dict keyed by model.* names.
Note: model.family is intentionally NOT included here — it is a
contextual tag, not a reproducibility param. Use build_run_scope_tags
to log it.
Source code in src/utils/mlflow_meta.py
get_features_profile(variant)
¶
Return the features.profile tag value for a pipeline variant name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
variant
|
str
|
Pipeline variant name (e.g. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Canonical |
str
|
if not in the known map. |
Source code in src/utils/mlflow_meta.py
infer_run_kind(experiment_name, stage_kind)
¶
Derive pipeline.run_kind from experiment name and stage.
If 'smoke' appears in the experiment name, returns 'smoke'
regardless of stage_kind. This reflects that smoke runs are a
reduced-scale variant of any stage, not a separate stage type.
Parameters¶
experiment_name:
MLflow experiment name (e.g. 'matches_clf_smoke').
stage_kind:
Default kind when not a smoke run: 'train_eval', 'ablation',
'tuning', or 'final_train'.
Returns¶
str
'smoke' when the experiment name contains 'smoke', otherwise
stage_kind unchanged.