Run this notebook: Open in Colab Open in Kaggle

Experiment Tracking with MLflow¶

🎯 Learning Objectives¶

Track ML experiments systematically
Log parameters, metrics, and artifacts
Compare different model runs
Register and version models
Reproduce experiments

Why Track Experiments?¶

Without tracking:

“Which hyperparameters gave the best result?”
“Can we reproduce last week’s model?”
“What changed between version 1 and 2?”

With tracking:

All experiments logged automatically
Easy comparison of runs
Model versioning and lineage
Reproducible results

# Install MLflow
# !pip install mlflow scikit-learn

Basic MLflow Usage¶

MLflow organizes experiment tracking around the concept of runs – individual executions of your training code where parameters, metrics, and artifacts are logged. Before training a model, you load your data and prepare train/test splits as usual. The key difference is that every subsequent step happens inside an MLflow run context, which captures a complete snapshot of the experiment. The mlflow.start_run() context manager creates a new run and ensures all logging calls are associated with it, making it straightforward to compare dozens or hundreds of experiments later.

import mlflow
import mlflow.sklearn
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
import numpy as np

# Load data
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42
)

print("Data loaded:", X_train.shape)

Experiment 1: Baseline Model¶

Every hyperparameter tuning campaign should begin with a baseline run – a simple model with reasonable defaults that establishes a performance floor. Here we use mlflow.log_param() to record hyperparameters like n_estimators and max_depth, and mlflow.log_metric() to record evaluation scores. The mlflow.sklearn.log_model() call serializes the trained model as an artifact, so you can reload it later without retraining. By capturing the baseline in MLflow, you have an objective reference point against which every subsequent experiment can be compared.

# Start MLflow run
with mlflow.start_run(run_name="baseline_rf"):
    # Set hyperparameters
    n_estimators = 50
    max_depth = 5
    
    # Log parameters
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    mlflow.log_param("model_type", "RandomForest")
    
    # Train model
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        random_state=42
    )
    model.fit(X_train, y_train)
    
    # Evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    # Log metrics
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("f1_score", f1)
    
    # Log model
    mlflow.sklearn.log_model(model, "model")
    
    print(f"Baseline - Accuracy: {accuracy:.4f}, F1: {f1:.4f}")

Experiment 2: Hyperparameter Tuning¶

With a baseline in place, the next step is a grid search over hyperparameter combinations. Each combination gets its own MLflow run, creating a structured record of every configuration tried. The nested loop below evaluates all pairs of n_estimators and max_depth, logging results for each. In the MLflow UI, you can sort runs by accuracy or F1 score, visualize how performance changes across the hyperparameter space, and quickly identify the best configuration. In production settings, this same pattern scales to hundreds of runs orchestrated by tools like Optuna or Ray Tune.

# Try different hyperparameters
for n_est in [50, 100, 200]:
    for depth in [5, 10, 15]:
        with mlflow.start_run(run_name=f"rf_nest{n_est}_depth{depth}"):
            # Log params
            mlflow.log_param("n_estimators", n_est)
            mlflow.log_param("max_depth", depth)
            mlflow.log_param("model_type", "RandomForest")
            
            # Train
            model = RandomForestClassifier(
                n_estimators=n_est,
                max_depth=depth,
                random_state=42
            )
            model.fit(X_train, y_train)
            
            # Evaluate
            y_pred = model.predict(X_test)
            accuracy = accuracy_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred, average='weighted')
            
            # Log metrics
            mlflow.log_metric("accuracy", accuracy)
            mlflow.log_metric("f1_score", f1)
            
            # Log model
            mlflow.sklearn.log_model(model, "model")
            
            print(f"n={n_est}, depth={depth}: Acc={accuracy:.4f}")

Viewing Results¶

MLflow UI¶

Launch the MLflow UI to compare experiments:

mlflow ui

Then visit: http://localhost:5000

Features:¶

Compare metrics across runs
Visualize parameter vs metric relationships
View logged artifacts
Download models

Model Registry¶

Once you identify the best-performing run, the MLflow Model Registry provides a centralized hub for managing model lifecycle stages. You register a model by pointing to a specific run’s artifact, then transition it through stages like Staging and Production. The registry tracks lineage – which experiment, data, and code produced each version – making audits and rollbacks straightforward. In team settings, the registry acts as a contract between data scientists (who produce models) and engineers (who deploy them), ensuring only vetted models reach production.

# Register model (after finding best run)
model_name = "wine_classifier"

# This would be done in the MLflow UI or programmatically:
# mlflow.register_model(
#     model_uri=f"runs:/<run_id>/model",
#     name=model_name
# )

# Transition model to production
# client = mlflow.tracking.MlflowClient()
# client.transition_model_version_stage(
#     name=model_name,
#     version=1,
#     stage="Production"
# )

print("Model registered and promoted to production")

Advanced: Autologging¶

MLflow’s autologging feature eliminates boilerplate by automatically capturing parameters, metrics, and artifacts for supported frameworks. When you call mlflow.sklearn.autolog(), every subsequent scikit-learn .fit() call will log all constructor arguments as parameters, training metrics, feature importances, and the serialized model – with zero manual log_param or log_metric calls. Autologging is available for scikit-learn, TensorFlow, PyTorch, XGBoost, LightGBM, and more. It is especially useful during rapid prototyping when you want comprehensive tracking without slowing down iteration speed.

# Enable autologging
mlflow.sklearn.autolog()

with mlflow.start_run(run_name="autolog_example"):
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    
    # MLflow automatically logs:
    # - All hyperparameters
    # - Training metrics
    # - Model artifacts
    # - Feature importance
    
    print("✓ Everything logged automatically!")

Best Practices¶

Consistent naming: Use descriptive run names
Log everything: Parameters, metrics, datasets, code
Tag runs: Add tags for easy filtering
Document: Add notes about each experiment
Clean up: Delete failed or duplicate runs

Key Takeaways¶

✅ Track all experiments systematically ✅ Log parameters, metrics, and artifacts ✅ Use MLflow UI for comparison ✅ Register production-ready models ✅ Enable autologging when possible