Experiment Tracking with MLflowΒΆ
π― Learning ObjectivesΒΆ
Track ML experiments systematically
Log parameters, metrics, and artifacts
Compare different model runs
Register and version models
Reproduce experiments
Why Track Experiments?ΒΆ
Without tracking:
βWhich hyperparameters gave the best result?β
βCan we reproduce last weekβs model?β
βWhat changed between version 1 and 2?β
With tracking:
All experiments logged automatically
Easy comparison of runs
Model versioning and lineage
Reproducible results
# Install MLflow
# !pip install mlflow scikit-learn
Basic MLflow UsageΒΆ
MLflow organizes experiment tracking around the concept of runs β individual executions of your training code where parameters, metrics, and artifacts are logged. Before training a model, you load your data and prepare train/test splits as usual. The key difference is that every subsequent step happens inside an MLflow run context, which captures a complete snapshot of the experiment. The mlflow.start_run() context manager creates a new run and ensures all logging calls are associated with it, making it straightforward to compare dozens or hundreds of experiments later.
import mlflow
import mlflow.sklearn
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
import numpy as np
# Load data
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42
)
print("Data loaded:", X_train.shape)
Experiment 1: Baseline ModelΒΆ
Every hyperparameter tuning campaign should begin with a baseline run β a simple model with reasonable defaults that establishes a performance floor. Here we use mlflow.log_param() to record hyperparameters like n_estimators and max_depth, and mlflow.log_metric() to record evaluation scores. The mlflow.sklearn.log_model() call serializes the trained model as an artifact, so you can reload it later without retraining. By capturing the baseline in MLflow, you have an objective reference point against which every subsequent experiment can be compared.
# Start MLflow run
with mlflow.start_run(run_name="baseline_rf"):
# Set hyperparameters
n_estimators = 50
max_depth = 5
# Log parameters
mlflow.log_param("n_estimators", n_estimators)
mlflow.log_param("max_depth", max_depth)
mlflow.log_param("model_type", "RandomForest")
# Train model
model = RandomForestClassifier(
n_estimators=n_estimators,
max_depth=max_depth,
random_state=42
)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
# Log metrics
mlflow.log_metric("accuracy", accuracy)
mlflow.log_metric("f1_score", f1)
# Log model
mlflow.sklearn.log_model(model, "model")
print(f"Baseline - Accuracy: {accuracy:.4f}, F1: {f1:.4f}")
Experiment 2: Hyperparameter TuningΒΆ
With a baseline in place, the next step is a grid search over hyperparameter combinations. Each combination gets its own MLflow run, creating a structured record of every configuration tried. The nested loop below evaluates all pairs of n_estimators and max_depth, logging results for each. In the MLflow UI, you can sort runs by accuracy or F1 score, visualize how performance changes across the hyperparameter space, and quickly identify the best configuration. In production settings, this same pattern scales to hundreds of runs orchestrated by tools like Optuna or Ray Tune.
# Try different hyperparameters
for n_est in [50, 100, 200]:
for depth in [5, 10, 15]:
with mlflow.start_run(run_name=f"rf_nest{n_est}_depth{depth}"):
# Log params
mlflow.log_param("n_estimators", n_est)
mlflow.log_param("max_depth", depth)
mlflow.log_param("model_type", "RandomForest")
# Train
model = RandomForestClassifier(
n_estimators=n_est,
max_depth=depth,
random_state=42
)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
# Log metrics
mlflow.log_metric("accuracy", accuracy)
mlflow.log_metric("f1_score", f1)
# Log model
mlflow.sklearn.log_model(model, "model")
print(f"n={n_est}, depth={depth}: Acc={accuracy:.4f}")
Viewing ResultsΒΆ
MLflow UIΒΆ
Launch the MLflow UI to compare experiments:
mlflow ui
Then visit: http://localhost:5000
Features:ΒΆ
Compare metrics across runs
Visualize parameter vs metric relationships
View logged artifacts
Download models
Model RegistryΒΆ
Once you identify the best-performing run, the MLflow Model Registry provides a centralized hub for managing model lifecycle stages. You register a model by pointing to a specific runβs artifact, then transition it through stages like Staging and Production. The registry tracks lineage β which experiment, data, and code produced each version β making audits and rollbacks straightforward. In team settings, the registry acts as a contract between data scientists (who produce models) and engineers (who deploy them), ensuring only vetted models reach production.
# Register model (after finding best run)
model_name = "wine_classifier"
# This would be done in the MLflow UI or programmatically:
# mlflow.register_model(
# model_uri=f"runs:/<run_id>/model",
# name=model_name
# )
# Transition model to production
# client = mlflow.tracking.MlflowClient()
# client.transition_model_version_stage(
# name=model_name,
# version=1,
# stage="Production"
# )
print("Model registered and promoted to production")
Advanced: AutologgingΒΆ
MLflowβs autologging feature eliminates boilerplate by automatically capturing parameters, metrics, and artifacts for supported frameworks. When you call mlflow.sklearn.autolog(), every subsequent scikit-learn .fit() call will log all constructor arguments as parameters, training metrics, feature importances, and the serialized model β with zero manual log_param or log_metric calls. Autologging is available for scikit-learn, TensorFlow, PyTorch, XGBoost, LightGBM, and more. It is especially useful during rapid prototyping when you want comprehensive tracking without slowing down iteration speed.
# Enable autologging
mlflow.sklearn.autolog()
with mlflow.start_run(run_name="autolog_example"):
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# MLflow automatically logs:
# - All hyperparameters
# - Training metrics
# - Model artifacts
# - Feature importance
print("β Everything logged automatically!")
Best PracticesΒΆ
Consistent naming: Use descriptive run names
Log everything: Parameters, metrics, datasets, code
Tag runs: Add tags for easy filtering
Document: Add notes about each experiment
Clean up: Delete failed or duplicate runs
Key TakeawaysΒΆ
β Track all experiments systematically β Log parameters, metrics, and artifacts β Use MLflow UI for comparison β Register production-ready models β Enable autologging when possible