# Install AutoML libraries
# Note: These have many dependencies, installation may take time
!pip install -q pycaret flaml optuna
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer, load_diabetes
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
# Set random seed
np.random.seed(42)
Part 1: Introduction to AutoMLΒΆ
What is AutoML?ΒΆ
Automated Machine Learning (AutoML) automates the end-to-end process of applying ML to real-world problems.
AutoML handles:
β Data preprocessing
β Feature engineering
β Model selection
β Hyperparameter tuning
β Ensemble methods
β Model evaluation
Benefits:
β‘ Faster prototyping
π― Finds good baselines
π Explores many models automatically
π Automates tedious tasks
π Good for learning
When to Use:
Quick baseline needed
Exploring new datasets
Limited ML expertise
Time constraints
Model comparison
When NOT to Use:
Highly specialized problems
Need full control
Production-critical systems (without review)
Very large datasets (computationally expensive)
PyCaret: A Low-Code ML Library for Rapid PrototypingΒΆ
PyCaret wraps scikit-learn, XGBoost, LightGBM, CatBoost, and other ML libraries behind a unified low-code API. The setup() function handles the entire preprocessing pipeline β encoding categorical variables, scaling numerical features, imputing missing values, and creating train/test splits β in a single call. The compare_models() function then trains and cross-validates 15+ algorithms, returning a ranked leaderboard sorted by your chosen metric.
Why PyCaret accelerates ML development: a typical scikit-learn workflow requires 50-100 lines of code for preprocessing, training, and evaluation. PyCaret compresses this to 5-10 lines while exploring a much broader model space. The tune_model() function performs Bayesian hyperparameter optimization, ensemble_model() creates bagging or boosting ensembles, and finalize_model() retrains on the full dataset for deployment. For quick baselines and model selection on tabular data, PyCaret consistently delivers results that are within 1-3% of hand-tuned solutions in a fraction of the time.
# Load classification dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
print(f"Dataset shape: {df.shape}")
print(f"\nTarget distribution:\n{df['target'].value_counts()}")
df.head()
from pycaret.classification import *
# Setup PyCaret environment
# This does preprocessing, train/test split, and more
clf_setup = setup(
data=df,
target='target',
session_id=42,
verbose=False,
normalize=True, # Normalize features
remove_outliers=False, # Keep outliers for now
train_size=0.8 # 80-20 split
)
print("Setup complete!")
# Compare all available models
# This trains and evaluates 10+ models!
print("Comparing models... (this may take a minute)")
best_models = compare_models(n_select=5, sort='AUC')
print("\nβ
Top 5 models identified!")
# Get the best model
best_model = best_models[0]
print(f"Best model: {best_model}")
# Evaluate the model
print("\nModel evaluation:")
evaluate_model(best_model)
# Tune hyperparameters
print("Tuning hyperparameters...")
tuned_model = tune_model(
best_model,
optimize='AUC',
n_iter=10 # Number of iterations
)
print("\nβ
Model tuned!")
# Make predictions
holdout_pred = predict_model(tuned_model)
print("Predictions on holdout set:")
print(holdout_pred.head())
# Get metrics
from sklearn.metrics import classification_report
print("\nClassification Report:")
print(classification_report(
holdout_pred['target'],
holdout_pred['prediction_label']
))
# Create ensemble
print("Creating ensemble...")
ensemble_model = ensemble_model(tuned_model, method='Bagging')
print("\nβ
Ensemble created!")
# Final predictions
final_pred = predict_model(ensemble_model)
print(f"\nFinal ensemble performance on holdout:")
print(f"Accuracy: {(final_pred['prediction_label'] == final_pred['target']).mean():.4f}")
# Save the model
save_model(ensemble_model, 'pycaret_best_model')
print("Model saved as 'pycaret_best_model.pkl'")
# Load it back (to demonstrate)
loaded_model = load_model('pycaret_best_model')
print("Model loaded successfully!")
FLAML: Fast and Lightweight AutoML from MicrosoftΒΆ
FLAML (Fast Lightweight AutoML) is optimized for efficiency, using a cost-frugal optimization strategy that allocates more computational budget to promising model configurations and prunes unpromising ones early. Unlike PyCaret, which trains all models to completion before ranking them, FLAML dynamically decides which algorithm and hyperparameter combination to try next based on the expected improvement per unit of compute time. This makes FLAML particularly effective under tight time budgets (30-60 seconds) where exhaustive search is infeasible.
Key differentiator: FLAMLβs time_budget parameter sets a hard wall-clock limit, and the system internally allocates time across model families (LightGBM, XGBoost, Random Forest, etc.) using a multi-armed bandit strategy. It starts with cheap-to-train models and progressively explores more expensive configurations. The result is that FLAML typically finds a near-optimal model in 1-2 minutes that would take PyCaretβs exhaustive search 5-10 minutes to match. For production ML pipelines where retraining happens on a schedule, FLAMLβs predictable runtime makes it easier to integrate into CI/CD workflows.
from flaml import AutoML
# Prepare data
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
# Initialize FLAML AutoML
automl = AutoML()
# Configure and run
settings = {
"time_budget": 60, # seconds
"metric": 'roc_auc',
"task": 'classification',
"log_file_name": 'flaml_experiment.log',
"seed": 42
}
print("Running FLAML AutoML (60 second budget)...")
automl.fit(X_train, y_train, **settings)
print("\nβ
AutoML complete!")
# Print best model and configuration
print("Best model found:")
print(f" Algorithm: {automl.best_estimator}")
print(f" ROC-AUC: {automl.best_loss:.4f}")
print(f"\nBest hyperparameters:")
for param, value in automl.best_config.items():
print(f" {param}: {value}")
# Evaluate on test set
from sklearn.metrics import roc_auc_score, accuracy_score
y_pred = automl.predict(X_test)
y_pred_proba = automl.predict_proba(X_test)[:, 1]
print("Test set performance:")
print(f" Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f" ROC-AUC: {roc_auc_score(y_test, y_pred_proba):.4f}")
# Visualize feature importance
if hasattr(automl.model.estimator, 'feature_importances_'):
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': automl.model.estimator.feature_importances_
}).sort_values('importance', ascending=False).head(10)
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.xlabel('Importance')
plt.title('Top 10 Feature Importances (FLAML)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
else:
print("Feature importance not available for this model type")
Regression with AutoML: Same APIs, Different TaskΒΆ
AutoML platforms handle regression with the same API as classification β just change the target variable and metric. PyCaretβs pycaret.regression module automatically detects continuous targets and switches to regression algorithms (Linear Regression, Ridge, Lasso, SVR, Gradient Boosting Regressors, etc.) and metrics (R\(^2\), RMSE, MAE, MAPE). The compare_models(sort='R2') call ranks all models by explained variance, providing an immediate sense of which algorithm family suits your dataβs structure.
Regression-specific considerations: tree-based models (Random Forest, XGBoost) often dominate on tabular regression tasks because they handle nonlinear relationships and feature interactions automatically. Linear models serve as interpretable baselines β if Ridge regression achieves \(R^2 = 0.85\) and XGBoost achieves \(R^2 = 0.87\), the 2% improvement may not justify the loss in interpretability. The actual-vs-predicted scatter plot generated below is the standard diagnostic: points clustered tightly along the diagonal indicate good fit, while systematic deviations (curves, fans) reveal model misspecification.
# Load regression dataset
diabetes = load_diabetes()
df_reg = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df_reg['target'] = diabetes.target
print(f"Dataset shape: {df_reg.shape}")
print(f"\nTarget statistics:")
print(df_reg['target'].describe())
df_reg.head()
# PyCaret for regression
from pycaret.regression import *
reg_setup = setup(
data=df_reg,
target='target',
session_id=42,
verbose=False,
normalize=True,
train_size=0.8
)
print("Regression setup complete!")
# Compare regression models
print("Comparing regression models...")
best_reg_models = compare_models(n_select=3, sort='R2')
print("\nβ
Top 3 regression models identified!")
# Tune the best model
best_reg = best_reg_models[0]
print(f"Tuning {best_reg}...")
tuned_reg = tune_model(best_reg, optimize='R2', n_iter=10)
print("\nβ
Model tuned!")
# Evaluate regression model
holdout_reg = predict_model(tuned_reg)
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
print("Regression metrics on holdout:")
print(f" RΒ² Score: {r2_score(holdout_reg['target'], holdout_reg['prediction_label']):.4f}")
print(f" RMSE: {np.sqrt(mean_squared_error(holdout_reg['target'], holdout_reg['prediction_label'])):.4f}")
print(f" MAE: {mean_absolute_error(holdout_reg['target'], holdout_reg['prediction_label']):.4f}")
# Visualize predictions
plt.figure(figsize=(10, 6))
plt.scatter(
holdout_reg['target'],
holdout_reg['prediction_label'],
alpha=0.6,
edgecolors='k'
)
plt.plot(
[holdout_reg['target'].min(), holdout_reg['target'].max()],
[holdout_reg['target'].min(), holdout_reg['target'].max()],
'r--',
lw=2,
label='Perfect Prediction'
)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted (Regression AutoML)')
plt.legend()
plt.tight_layout()
plt.show()
Comparing AutoML Platforms: Performance vs. Speed TradeoffΒΆ
Running multiple AutoML platforms on the same dataset with the same train/test split provides an objective comparison of their search strategies. FLAMLβs cost-frugal approach typically achieves competitive ROC-AUC in 30-60 seconds by focusing compute on the most promising configurations. PyCaretβs exhaustive approach takes longer but explores more model families. A manual Random Forest baseline establishes whether AutoML provides meaningful improvement over a reasonable default configuration.
Interpreting results: if all platforms achieve similar ROC-AUC (within 0.5-1%), the dataset is likely βeasyβ and the choice should favor the fastest or most interpretable option. If AutoML significantly outperforms the manual baseline, the winning algorithm and hyperparameters reveal what the dataset needs (e.g., gradient boosting winning suggests important feature interactions; Lasso winning suggests many irrelevant features). Always report both performance and wall-clock time β a 0.3% accuracy improvement that takes 10x longer to train may not be worth the computational cost in production retraining pipelines.
import time
# Prepare data
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
results = []
# 1. FLAML
print("Testing FLAML...")
start = time.time()
flaml_automl = AutoML()
flaml_automl.fit(
X_train, y_train,
task='classification',
metric='roc_auc',
time_budget=30,
verbose=0
)
flaml_time = time.time() - start
flaml_pred = flaml_automl.predict_proba(X_test)[:, 1]
flaml_score = roc_auc_score(y_test, flaml_pred)
results.append({
'Platform': 'FLAML',
'Time (s)': flaml_time,
'ROC-AUC': flaml_score,
'Best Model': flaml_automl.best_estimator
})
print(f"β
FLAML: {flaml_score:.4f} in {flaml_time:.2f}s")
# 2. Manual baseline (for comparison)
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
print("Testing Manual RF baseline...")
start = time.time()
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_scaled, y_train)
manual_time = time.time() - start
manual_pred = rf.predict_proba(X_test_scaled)[:, 1]
manual_score = roc_auc_score(y_test, manual_pred)
results.append({
'Platform': 'Manual RF',
'Time (s)': manual_time,
'ROC-AUC': manual_score,
'Best Model': 'RandomForest'
})
print(f"β
Manual RF: {manual_score:.4f} in {manual_time:.2f}s")
# Compare results
comparison_df = pd.DataFrame(results)
comparison_df = comparison_df.sort_values('ROC-AUC', ascending=False)
print("\n" + "="*60)
print("AutoML Platform Comparison")
print("="*60)
print(comparison_df.to_string(index=False))
print("="*60)
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# ROC-AUC comparison
axes[0].barh(comparison_df['Platform'], comparison_df['ROC-AUC'], color='steelblue')
axes[0].set_xlabel('ROC-AUC Score')
axes[0].set_title('Performance Comparison')
axes[0].set_xlim(0.9, 1.0)
# Time comparison
axes[1].barh(comparison_df['Platform'], comparison_df['Time (s)'], color='coral')
axes[1].set_xlabel('Time (seconds)')
axes[1].set_title('Speed Comparison')
plt.tight_layout()
plt.show()
Best Practices and Platform Selection GuideΒΆ
AutoML is a powerful accelerator but not a replacement for ML expertise. The best workflow uses AutoML for rapid exploration (finding promising model families and feature interactions in minutes) and then applies targeted manual optimization to the winning approach. Common mistakes include treating AutoML as a black box (not inspecting the selected modelβs predictions for sanity), using it on poorly cleaned data (garbage in, garbage out still applies), and deploying AutoML-selected models without validating on truly held-out data (some platforms perform model selection on the validation set, which can overfit the selection criterion).
Platform selection heuristic: use PyCaret when you need a comprehensive, visual workflow for stakeholder presentations and rapid prototyping. Use FLAML when compute budget or time is constrained, or when integrating AutoML into automated retraining pipelines. Use H2O AutoML for datasets that exceed memory on a single machine (distributed computing). Use manual scikit-learn when you need full transparency, custom loss functions, or domain-specific model architectures that AutoML platforms do not support.
print("""
AutoML Best Practices:
1. β
START WITH AUTOML for baselines
- Get quick results
- Understand data better
- Identify promising models
2. β
SET REASONABLE TIME BUDGETS
- Start with 60-300 seconds
- Increase if needed
- Balance speed vs performance
3. β
VALIDATE RESULTS
- Check on holdout data
- Look for overfitting
- Understand model decisions
4. β
INSPECT TOP MODELS
- Don't just use the best
- Consider interpretability
- Check robustness
5. β
USE FOR EXPLORATION
- Try different features
- Test hypotheses quickly
- Compare preprocessing steps
6. β DON'T BLINDLY TRUST
- Review model choices
- Understand limitations
- Test edge cases
7. β DON'T SKIP DATA CLEANING
- AutoML isn't magic
- Clean data = better results
- Handle domain-specific issues
Platform Selection Guide:
Use PyCaret when:
β’ Need comprehensive pipeline
β’ Want visualization tools
β’ Prefer low-code approach
β’ Building prototypes
Use FLAML when:
β’ Speed is critical
β’ Limited compute resources
β’ Want cost optimization
β’ Production deployment
Use H2O when:
β’ Very large datasets
β’ Need distributed computing
β’ Enterprise deployment
β’ Java integration
Use Manual ML when:
β’ Full control needed
β’ Custom architectures
β’ Specialized domains
β’ Learning purposes
""")
π― Key TakeawaysΒΆ
AutoML accelerates ML development - Quick baselines and model exploration
Multiple platforms available - PyCaret (comprehensive), FLAML (fast), H2O (scalable)
Not a silver bullet - Still need data understanding and validation
Great for baselines - Perfect starting point before custom optimization
Time-performance tradeoff - More time generally = better models
Interpretability matters - Donβt sacrifice understanding for slight accuracy gains
π Practice ExercisesΒΆ
Compare AutoML Platforms
Load a dataset
Run PyCaret, FLAML, and manual baseline
Compare results and insights
Feature Engineering Impact
Create new features
Run AutoML before/after
Measure improvement
Time Budget Experiment
Try different time budgets (10s, 60s, 300s)
Plot performance vs time
Find optimal budget