Experiment Tracking with MLflowΒΆ

🎯 Learning Objectives¢

  • Track ML experiments systematically

  • Log parameters, metrics, and artifacts

  • Compare different model runs

  • Register and version models

  • Reproduce experiments

Why Track Experiments?ΒΆ

Without tracking:

  • β€œWhich hyperparameters gave the best result?”

  • β€œCan we reproduce last week’s model?”

  • β€œWhat changed between version 1 and 2?”

With tracking:

  • All experiments logged automatically

  • Easy comparison of runs

  • Model versioning and lineage

  • Reproducible results

# Install MLflow
# !pip install mlflow scikit-learn

Basic MLflow UsageΒΆ

MLflow organizes experiment tracking around the concept of runs – individual executions of your training code where parameters, metrics, and artifacts are logged. Before training a model, you load your data and prepare train/test splits as usual. The key difference is that every subsequent step happens inside an MLflow run context, which captures a complete snapshot of the experiment. The mlflow.start_run() context manager creates a new run and ensures all logging calls are associated with it, making it straightforward to compare dozens or hundreds of experiments later.

import mlflow
import mlflow.sklearn
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
import numpy as np

# Load data
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42
)

print("Data loaded:", X_train.shape)

Experiment 1: Baseline ModelΒΆ

Every hyperparameter tuning campaign should begin with a baseline run – a simple model with reasonable defaults that establishes a performance floor. Here we use mlflow.log_param() to record hyperparameters like n_estimators and max_depth, and mlflow.log_metric() to record evaluation scores. The mlflow.sklearn.log_model() call serializes the trained model as an artifact, so you can reload it later without retraining. By capturing the baseline in MLflow, you have an objective reference point against which every subsequent experiment can be compared.

# Start MLflow run
with mlflow.start_run(run_name="baseline_rf"):
    # Set hyperparameters
    n_estimators = 50
    max_depth = 5
    
    # Log parameters
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    mlflow.log_param("model_type", "RandomForest")
    
    # Train model
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        random_state=42
    )
    model.fit(X_train, y_train)
    
    # Evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    # Log metrics
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("f1_score", f1)
    
    # Log model
    mlflow.sklearn.log_model(model, "model")
    
    print(f"Baseline - Accuracy: {accuracy:.4f}, F1: {f1:.4f}")

Experiment 2: Hyperparameter TuningΒΆ

With a baseline in place, the next step is a grid search over hyperparameter combinations. Each combination gets its own MLflow run, creating a structured record of every configuration tried. The nested loop below evaluates all pairs of n_estimators and max_depth, logging results for each. In the MLflow UI, you can sort runs by accuracy or F1 score, visualize how performance changes across the hyperparameter space, and quickly identify the best configuration. In production settings, this same pattern scales to hundreds of runs orchestrated by tools like Optuna or Ray Tune.

# Try different hyperparameters
for n_est in [50, 100, 200]:
    for depth in [5, 10, 15]:
        with mlflow.start_run(run_name=f"rf_nest{n_est}_depth{depth}"):
            # Log params
            mlflow.log_param("n_estimators", n_est)
            mlflow.log_param("max_depth", depth)
            mlflow.log_param("model_type", "RandomForest")
            
            # Train
            model = RandomForestClassifier(
                n_estimators=n_est,
                max_depth=depth,
                random_state=42
            )
            model.fit(X_train, y_train)
            
            # Evaluate
            y_pred = model.predict(X_test)
            accuracy = accuracy_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred, average='weighted')
            
            # Log metrics
            mlflow.log_metric("accuracy", accuracy)
            mlflow.log_metric("f1_score", f1)
            
            # Log model
            mlflow.sklearn.log_model(model, "model")
            
            print(f"n={n_est}, depth={depth}: Acc={accuracy:.4f}")

Viewing ResultsΒΆ

MLflow UIΒΆ

Launch the MLflow UI to compare experiments:

mlflow ui

Then visit: http://localhost:5000

Features:ΒΆ

  • Compare metrics across runs

  • Visualize parameter vs metric relationships

  • View logged artifacts

  • Download models

Model RegistryΒΆ

Once you identify the best-performing run, the MLflow Model Registry provides a centralized hub for managing model lifecycle stages. You register a model by pointing to a specific run’s artifact, then transition it through stages like Staging and Production. The registry tracks lineage – which experiment, data, and code produced each version – making audits and rollbacks straightforward. In team settings, the registry acts as a contract between data scientists (who produce models) and engineers (who deploy them), ensuring only vetted models reach production.

# Register model (after finding best run)
model_name = "wine_classifier"

# This would be done in the MLflow UI or programmatically:
# mlflow.register_model(
#     model_uri=f"runs:/<run_id>/model",
#     name=model_name
# )

# Transition model to production
# client = mlflow.tracking.MlflowClient()
# client.transition_model_version_stage(
#     name=model_name,
#     version=1,
#     stage="Production"
# )

print("Model registered and promoted to production")

Advanced: AutologgingΒΆ

MLflow’s autologging feature eliminates boilerplate by automatically capturing parameters, metrics, and artifacts for supported frameworks. When you call mlflow.sklearn.autolog(), every subsequent scikit-learn .fit() call will log all constructor arguments as parameters, training metrics, feature importances, and the serialized model – with zero manual log_param or log_metric calls. Autologging is available for scikit-learn, TensorFlow, PyTorch, XGBoost, LightGBM, and more. It is especially useful during rapid prototyping when you want comprehensive tracking without slowing down iteration speed.

# Enable autologging
mlflow.sklearn.autolog()

with mlflow.start_run(run_name="autolog_example"):
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    
    # MLflow automatically logs:
    # - All hyperparameters
    # - Training metrics
    # - Model artifacts
    # - Feature importance
    
    print("βœ“ Everything logged automatically!")

Best PracticesΒΆ

  1. Consistent naming: Use descriptive run names

  2. Log everything: Parameters, metrics, datasets, code

  3. Tag runs: Add tags for easy filtering

  4. Document: Add notes about each experiment

  5. Clean up: Delete failed or duplicate runs

Key TakeawaysΒΆ

βœ… Track all experiments systematically βœ… Log parameters, metrics, and artifacts βœ… Use MLflow UI for comparison βœ… Register production-ready models βœ… Enable autologging when possible