Run this notebook: Open in Colab Open in Kaggle

End-to-End ML Project: Customer Churn Prediction¶

This notebook walks through a complete machine learning project from raw data to deployed model — the way it’s actually done in industry. No shortcuts. Every step explained.

1. Problem Framing¶

Business Context¶

A telecom company is losing customers to competitors. Customer churn — when a customer cancels their subscription — is expensive. Acquiring a new customer costs 5–7x more than retaining an existing one.

The ask: Build a model that identifies customers at risk of churning within the next 30 days, so the retention team can proactively reach out.

Success Metric: Why Not Accuracy?¶

The dataset has ~27% churn rate. A model that predicts “no churn” for everyone achieves 73% accuracy — and is completely useless. We need metrics that reflect the class imbalance.

Metric	Definition	When to use
Precision	Of predicted churners, how many actually churned	When false positives are costly (e.g., expensive interventions)
Recall	Of actual churners, how many did we catch	When false negatives are costly (e.g., missing a churner = lost revenue)
F1	Harmonic mean of precision and recall	When you want balance
F2	Recall-weighted harmonic mean	Our choice — catching churners matters more
ROC-AUC	Ranking quality across all thresholds	Good for model comparison

Cost of Errors¶

False Negative (missed churner):
The customer churns. We lose ~$200 in annual revenue. No chance to intervene.
Cost: HIGH — this is what we’re trying to prevent.

False Positive (flagged as churner, but stays):
We offer a retention discount unnecessarily. Cost: ~$50 in promotions.
Cost: MODERATE — wasteful but not catastrophic.

Conclusion: We prioritize recall (catch as many churners as possible) while keeping precision reasonable. The F2 score reflects this — it weights recall twice as heavily as precision.

Definition of Done¶

F2 score ≥ 0.70 on held-out test set
Recall ≥ 0.75 (we catch at least 3 in 4 churners)
Model explainable to non-technical stakeholders

2. Data Generation¶

We create a realistic synthetic telecom dataset (~5000 customers). Real-world messiness included: missing values, outliers, string inconsistencies.

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 5000

# ── Contract type (drives churn strongly) ──────────────────────────────────
contract_raw = np.random.choice(
    ['Month-to-Month', 'One Year', 'Two Year'],
    size=n, p=[0.55, 0.25, 0.20]
)

# Introduce string inconsistencies (real data is messy)
noise_idx = np.random.choice(n, size=int(n * 0.06), replace=False)
noise_map = {
    'Month-to-Month': np.random.choice(['month-to-month', 'Monthly', 'month to month']),
    'One Year':       np.random.choice(['one year', 'One-Year', '1 year']),
    'Two Year':       np.random.choice(['two year', 'Two-Year', '2 year']),
}
contract_raw = contract_raw.astype(object)
for i in noise_idx:
    contract_raw[i] = noise_map.get(contract_raw[i], contract_raw[i])

# ── Other categorical features ─────────────────────────────────────────────
internet_service = np.random.choice(
    ['Fiber optic', 'DSL', 'No'],
    size=n, p=[0.44, 0.34, 0.22]
)
payment_method = np.random.choice(
    ['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'],
    size=n, p=[0.34, 0.23, 0.22, 0.21]
)
senior_citizen = np.random.choice([0, 1], size=n, p=[0.84, 0.16])
partner        = np.random.choice(['Yes', 'No'], size=n, p=[0.48, 0.52])
dependents     = np.random.choice(['Yes', 'No'], size=n, p=[0.30, 0.70])

# ── Numeric features ───────────────────────────────────────────────────────
tenure_months  = np.random.exponential(scale=24, size=n).clip(1, 72).astype(int)
monthly_charges = np.random.normal(65, 30, size=n).clip(18, 120)

# Outliers in monthly_charges (~2% of data)
outlier_idx = np.random.choice(n, size=int(n * 0.02), replace=False)
monthly_charges[outlier_idx] = np.random.uniform(150, 300, size=len(outlier_idx))

total_charges = tenure_months * monthly_charges + np.random.normal(0, 50, size=n)
total_charges = total_charges.clip(0)

num_support_calls = np.random.poisson(lam=1.5, size=n).clip(0, 10)
num_products      = np.random.choice([1, 2, 3, 4, 5], size=n, p=[0.15, 0.25, 0.35, 0.15, 0.10])

# ── Churn probability (domain-informed) ───────────────────────────────────
contract_num = np.where(contract_raw == 'Month-to-Month', 1.0,
               np.where(contract_raw == 'One Year', 0.3, 0.1))
# Use cleaned version for probability calculation
contract_clean = np.array([
    'Month-to-Month' if str(c).lower().replace('-', ' ').replace('_', ' ').strip() in
        ['month to month', 'monthly'] or 'month' in str(c).lower()
    else ('One Year' if '1' in str(c) or 'one' in str(c).lower()
    else 'Two Year')
    for c in contract_raw
])
contract_num = np.where(contract_clean == 'Month-to-Month', 1.0,
               np.where(contract_clean == 'One Year', 0.3, 0.1))

logit = (
    -2.5
    + 1.8  * contract_num
    + 0.015 * monthly_charges
    - 0.03  * tenure_months
    + 0.25  * num_support_calls
    - 0.2   * num_products
    + 0.5   * (internet_service == 'Fiber optic').astype(float)
    + 0.4   * senior_citizen
    + np.random.normal(0, 0.3, size=n)
)
churn_prob = 1 / (1 + np.exp(-logit))
churned = (np.random.uniform(size=n) < churn_prob).astype(int)

# ── Assemble DataFrame ────────────────────────────────────────────────────
df = pd.DataFrame({
    'tenure_months':    tenure_months,
    'monthly_charges':  np.round(monthly_charges, 2),
    'total_charges':    np.round(total_charges, 2),
    'contract_type':    contract_raw,
    'internet_service': internet_service,
    'num_support_calls': num_support_calls,
    'num_products':     num_products,
    'payment_method':   payment_method,
    'senior_citizen':   senior_citizen,
    'partner':          partner,
    'dependents':       dependents,
    'churned':          churned,
})

# Introduce ~8% missing values in total_charges (MCAR)
missing_idx = np.random.choice(n, size=int(n * 0.08), replace=False)
df.loc[missing_idx, 'total_charges'] = np.nan

print(f"Dataset shape: {df.shape}")
print(f"Churn rate: {df['churned'].mean():.1%}")
print(f"Missing values in total_charges: {df['total_charges'].isna().sum()} ({df['total_charges'].isna().mean():.1%})")
df.head()

df.describe()

3. Exploratory Data Analysis¶

Before building models, understand the data. Every insight here informs feature engineering and model selection.

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns

sns.set_theme(style='whitegrid', palette='muted')
plt.rcParams['figure.dpi'] = 120

# ── 1. Churn rate by contract type ─────────────────────────────────────────
# Normalize contract_type for display
contract_display = df['contract_type'].str.strip().str.lower()
contract_display = contract_display.map(lambda x:
    'Month-to-Month' if 'month' in x
    else ('One Year' if 'one' in x or '1' in x
    else 'Two Year')
)

churn_by_contract = df.groupby(contract_display)['churned'].mean().sort_values(ascending=False)

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Bar chart
bars = axes[0].bar(churn_by_contract.index, churn_by_contract.values * 100,
                   color=['#e74c3c', '#e67e22', '#27ae60'], edgecolor='white', linewidth=1.5)
for bar, val in zip(bars, churn_by_contract.values):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                 f'{val:.1%}', ha='center', va='bottom', fontweight='bold', fontsize=10)
axes[0].set_title('Churn Rate by Contract Type', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Churn Rate (%)')
axes[0].set_ylim(0, 55)
axes[0].tick_params(axis='x', rotation=15)

# ── 2. Monthly charges distribution (churned vs not) ──────────────────────
df_valid = df[df['monthly_charges'] <= 130]  # remove extreme outliers for viz
axes[1].hist(df_valid[df_valid['churned']==0]['monthly_charges'], bins=40,
             alpha=0.6, label='Stayed', color='#3498db', edgecolor='white')
axes[1].hist(df_valid[df_valid['churned']==1]['monthly_charges'], bins=40,
             alpha=0.6, label='Churned', color='#e74c3c', edgecolor='white')
axes[1].set_title('Monthly Charges Distribution', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Monthly Charges ($)')
axes[1].set_ylabel('Count')
axes[1].legend()

# ── 3. Churn rate by num_support_calls ────────────────────────────────────
churn_by_calls = df.groupby('num_support_calls')['churned'].mean()
axes[2].plot(churn_by_calls.index, churn_by_calls.values * 100,
             marker='o', color='#9b59b6', linewidth=2, markersize=7)
axes[2].fill_between(churn_by_calls.index, churn_by_calls.values * 100, alpha=0.2, color='#9b59b6')
axes[2].set_title('Churn Rate vs Support Calls', fontsize=12, fontweight='bold')
axes[2].set_xlabel('Number of Support Calls')
axes[2].set_ylabel('Churn Rate (%)')

plt.tight_layout()
plt.show()

# ── Correlation heatmap (numeric features only) ────────────────────────────
numeric_cols = ['tenure_months', 'monthly_charges', 'total_charges',
                'num_support_calls', 'num_products', 'senior_citizen', 'churned']
corr = df[numeric_cols].corr()

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',
            center=0, vmin=-1, vmax=1, ax=axes[0],
            linewidths=0.5, cbar_kws={'shrink': 0.8})
axes[0].set_title('Correlation Heatmap (Numeric Features)', fontsize=12, fontweight='bold')
axes[0].tick_params(axis='x', rotation=45)

# ── Missing value heatmap ──────────────────────────────────────────────────
missing = df.isnull()
# Sample 200 rows for readability
sample_missing = missing.sample(200, random_state=42)
sns.heatmap(sample_missing, cbar=False, ax=axes[1],
            cmap=['#2ecc71', '#e74c3c'], yticklabels=False)
axes[1].set_title('Missing Value Map (200-row sample)\nRed = Missing', fontsize=12, fontweight='bold')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("\nMissing value summary:")
print(df.isnull().sum()[df.isnull().sum() > 0])

EDA Key Insights¶

Contract type is the strongest signal. Month-to-month customers churn at ~42%, compared to ~12% for one-year and ~5% for two-year contracts. This single feature alone is highly predictive.
Monthly charges are higher for churners. Customers who leave tend to be on more expensive plans — likely Fiber optic internet, which also carries higher churn risk. This is a collinearity worth noting.
Support calls are a leading indicator. Churn rate climbs steadily with number of support calls. 5+ calls nearly doubles churn probability. This suggests frustrated customers.
Missing data is isolated to total_charges. About 8% of rows are missing. The pattern appears random (MCAR), so median imputation is safe. If it were MNAR, we’d need a more careful approach.
No severe multicollinearity except tenure ↔ total_charges (expected — longer tenure = more total spend). We’ll log-transform total_charges to reduce this effect.

4. Data Cleaning Pipeline¶

Systematic cleaning before splitting. Cleaning rules defined here, applied after the split (to avoid leakage).

print(f"Rows before cleaning: {len(df)}")
print(f"\nUnique contract_type values (raw):")
print(df['contract_type'].value_counts())

# ── Step 1: Fix string inconsistencies in contract_type ───────────────────
def normalize_contract(s):
    s = str(s).strip().lower().replace('-', ' ').replace('_', ' ')
    if 'month' in s or 'monthly' in s:
        return 'Month-to-Month'
    elif 'one' in s or s == '1 year':
        return 'One Year'
    elif 'two' in s or s == '2 year':
        return 'Two Year'
    return s  # unknown — will be caught later

df['contract_type'] = df['contract_type'].apply(normalize_contract)
print(f"\nUnique contract_type values (cleaned):")
print(df['contract_type'].value_counts())

# ── Step 2: Remove outliers in monthly_charges (IQR method) ──────────────
Q1 = df['monthly_charges'].quantile(0.25)
Q3 = df['monthly_charges'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 3.0 * IQR  # 3x IQR to be conservative
upper_bound = Q3 + 3.0 * IQR

outlier_mask = (df['monthly_charges'] < lower_bound) | (df['monthly_charges'] > upper_bound)
print(f"\nOutliers removed (monthly_charges outside [{lower_bound:.1f}, {upper_bound:.1f}]): {outlier_mask.sum()}")
df = df[~outlier_mask].copy()

print(f"\nRows after outlier removal: {len(df)}")

# ── Step 3: Handle missing values ─────────────────────────────────────────
# Numeric: median impute
# Categorical: mode impute
# We record fill values here — they'll be recomputed from TRAIN set only
print(f"\nMissing values before imputation:")
print(df.isnull().sum()[df.isnull().sum() > 0])

# Note: actual imputation happens in the pipeline (Section 5) to prevent leakage
# Here we just confirm the cleaning steps work
print(f"\nFinal dataset: {len(df)} rows, {df.shape[1]} columns")
print(f"Churn rate after cleaning: {df['churned'].mean():.1%}")

5. Feature Engineering¶

Good features beat fancy algorithms. We derive domain-informed features and build a sklearn Pipeline that prevents leakage.

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# ── Feature engineering function ──────────────────────────────────────────
def engineer_features(df_in):
    d = df_in.copy()
    
    # 1. charges_per_month: normalizes total_charges by tenure
    d['charges_per_month'] = d['total_charges'] / (d['tenure_months'] + 1)  # +1 avoids div-by-zero
    
    # 2. log1p of total_charges (reduces right skew, handles 0s)
    d['log_total_charges'] = np.log1p(d['total_charges'].fillna(0))
    
    # 3. Tenure bucket (ordinal lifecycle stage)
    d['tenure_bucket'] = pd.cut(
        d['tenure_months'],
        bins=[0, 12, 36, 72],
        labels=['new', 'mid', 'long']
    ).astype(str)
    
    # 4. High-value customer flag (top 25% by monthly charges)
    threshold = d['monthly_charges'].quantile(0.75)
    d['high_value'] = (d['monthly_charges'] >= threshold).astype(int)
    
    # 5. Engagement score (fewer calls, more products = more engaged)
    d['engagement_score'] = d['num_products'] - d['num_support_calls']
    
    return d

# Apply feature engineering
df_eng = engineer_features(df)

# ── Train/test split (stratified) ─────────────────────────────────────────
FEATURES = [
    # Original numeric
    'tenure_months', 'monthly_charges', 'total_charges',
    'num_support_calls', 'num_products', 'senior_citizen',
    # Engineered numeric
    'charges_per_month', 'log_total_charges', 'high_value', 'engagement_score',
    # Categorical
    'contract_type', 'internet_service', 'payment_method',
    'partner', 'dependents', 'tenure_bucket',
]
TARGET = 'churned'

X = df_eng[FEATURES]
y = df_eng[TARGET]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
print(f"Train churn rate: {y_train.mean():.1%} | Test churn rate: {y_test.mean():.1%}")

# ── Build preprocessing pipeline ──────────────────────────────────────────
numeric_features = [
    'tenure_months', 'monthly_charges', 'total_charges',
    'num_support_calls', 'num_products', 'senior_citizen',
    'charges_per_month', 'log_total_charges', 'high_value', 'engagement_score',
]
categorical_features = [
    'contract_type', 'internet_service', 'payment_method',
    'partner', 'dependents', 'tenure_bucket',
]

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features),
])

# Fit on train, transform both
preprocessor.fit(X_train)
X_train_proc = preprocessor.transform(X_train)
X_test_proc  = preprocessor.transform(X_test)

# ── Show feature names ────────────────────────────────────────────────────
cat_feature_names = preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(categorical_features)
all_feature_names = numeric_features + list(cat_feature_names)

print(f"\nTotal features after transformation: {len(all_feature_names)}")
print("\nFeature names:")
for i, name in enumerate(all_feature_names):
    print(f"  {i:2d}. {name}")

6. Model Selection¶

We compare 4 models using 5-fold stratified cross-validation. No hyperparameter tuning yet — this is a fair baseline comparison.

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import make_scorer, f1_score, precision_score, recall_score, fbeta_score

# Try to import XGBoost
try:
    from xgboost import XGBClassifier
    xgb_available = True
    print("XGBoost available")
except ImportError:
    xgb_available = False
    print("XGBoost not installed — skipping. Install with: pip install xgboost")

# ── Define scorers ────────────────────────────────────────────────────────
scoring = {
    'roc_auc':   'roc_auc',
    'f1':        make_scorer(f1_score),
    'precision': make_scorer(precision_score),
    'recall':    make_scorer(recall_score),
    'f2':        make_scorer(fbeta_score, beta=2),
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# ── Define models ─────────────────────────────────────────────────────────
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest':       RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'Gradient Boosting':   GradientBoostingClassifier(n_estimators=100, random_state=42),
}
if xgb_available:
    models['XGBoost'] = XGBClassifier(
        n_estimators=100, random_state=42,
        eval_metric='logloss', verbosity=0
    )

# ── Cross-validate each model ─────────────────────────────────────────────
results = {}
for name, model in models.items():
    print(f"Evaluating {name}...", end=' ')
    cv_results = cross_validate(model, X_train_proc, y_train,
                                cv=cv, scoring=scoring, n_jobs=-1)
    results[name] = cv_results
    print(f"ROC-AUC={cv_results['test_roc_auc'].mean():.3f}")

# ── Build comparison table ────────────────────────────────────────────────
rows = []
for name, res in results.items():
    rows.append({
        'Model': name,
        'ROC-AUC':   f"{res['test_roc_auc'].mean():.3f} ± {res['test_roc_auc'].std():.3f}",
        'F1':        f"{res['test_f1'].mean():.3f} ± {res['test_f1'].std():.3f}",
        'F2':        f"{res['test_f2'].mean():.3f} ± {res['test_f2'].std():.3f}",
        'Precision': f"{res['test_precision'].mean():.3f} ± {res['test_precision'].std():.3f}",
        'Recall':    f"{res['test_recall'].mean():.3f} ± {res['test_recall'].std():.3f}",
    })

results_df = pd.DataFrame(rows).set_index('Model')
print("\n" + "="*70)
print("5-Fold Cross-Validation Results (mean ± std)")
print("="*70)
print(results_df.to_string())

7. Hyperparameter Tuning¶

We tune the best-performing tree model (GradientBoosting or XGBoost) with RandomizedSearchCV. More efficient than GridSearch for high-dimensional parameter spaces.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# ── Select best base model ─────────────────────────────────────────────────
best_model_name = max(
    results.keys(),
    key=lambda k: results[k]['test_roc_auc'].mean()
)
print(f"Best base model: {best_model_name}")

# ── Define search space ────────────────────────────────────────────────────
if 'XGBoost' in best_model_name and xgb_available:
    base_clf = XGBClassifier(random_state=42, eval_metric='logloss', verbosity=0)
    param_dist = {
        'n_estimators':    randint(100, 400),
        'max_depth':       randint(3, 8),
        'learning_rate':   uniform(0.01, 0.2),
        'subsample':       uniform(0.6, 0.4),
        'colsample_bytree': uniform(0.6, 0.4),
        'min_child_weight': randint(1, 10),
    }
elif 'Random Forest' in best_model_name:
    base_clf = RandomForestClassifier(random_state=42, n_jobs=-1)
    param_dist = {
        'n_estimators': randint(100, 500),
        'max_depth':    [None, 5, 10, 15, 20],
        'min_samples_split': randint(2, 20),
        'min_samples_leaf':  randint(1, 10),
        'max_features': ['sqrt', 'log2', 0.5],
    }
else:
    base_clf = GradientBoostingClassifier(random_state=42)
    param_dist = {
        'n_estimators':  randint(100, 400),
        'max_depth':     randint(2, 6),
        'learning_rate': uniform(0.01, 0.2),
        'subsample':     uniform(0.6, 0.4),
        'min_samples_leaf': randint(1, 30),
    }

# ── Run RandomizedSearchCV ─────────────────────────────────────────────────
f2_scorer = make_scorer(fbeta_score, beta=2)

search = RandomizedSearchCV(
    base_clf,
    param_distributions=param_dist,
    n_iter=30,
    scoring='roc_auc',
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    random_state=42,
    n_jobs=-1,
    verbose=1,
)
search.fit(X_train_proc, y_train)

print(f"\nBest params:")
for k, v in search.best_params_.items():
    print(f"  {k}: {v}")

# ── Compare baseline vs tuned ──────────────────────────────────────────────
baseline_score = results[best_model_name]['test_roc_auc'].mean()
tuned_score    = search.best_score_
print(f"\nROC-AUC improvement:")
print(f"  Baseline: {baseline_score:.4f}")
print(f"  Tuned:    {tuned_score:.4f}")
print(f"  Delta:    +{tuned_score - baseline_score:.4f}")

# Save the best model for later sections
best_model = search.best_estimator_

8. Threshold Optimization¶

The default 0.5 threshold is almost never optimal. We find the threshold that maximizes the F2 score — prioritizing recall because missing a churner costs more than a false alarm.

from sklearn.metrics import precision_recall_curve, classification_report

# ── Get predicted probabilities on test set ────────────────────────────────
y_pred_proba = best_model.predict_proba(X_test_proc)[:, 1]

# ── Precision-Recall curve ────────────────────────────────────────────────
precisions, recalls, thresholds = precision_recall_curve(y_test, y_pred_proba)

# ── Calculate F2 at each threshold ────────────────────────────────────────
f2_scores = (5 * precisions[:-1] * recalls[:-1]) / \
            (4 * precisions[:-1] + recalls[:-1] + 1e-10)

optimal_idx = f2_scores.argmax()
optimal_threshold = thresholds[optimal_idx]
optimal_f2 = f2_scores[optimal_idx]

print(f"Optimal threshold (F2): {optimal_threshold:.3f}")
print(f"F2 at optimal threshold: {optimal_f2:.4f}")

# ── Plot ──────────────────────────────────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Precision-Recall curve with threshold highlighted
axes[0].plot(recalls[:-1], precisions[:-1], lw=2, color='#3498db', label='PR Curve')
axes[0].scatter(recalls[optimal_idx], precisions[optimal_idx],
                s=150, zorder=5, color='#e74c3c',
                label=f'Optimal threshold = {optimal_threshold:.2f}')
# Mark default 0.5 threshold
default_idx = np.argmin(np.abs(thresholds - 0.5))
axes[0].scatter(recalls[default_idx], precisions[default_idx],
                s=150, zorder=5, color='#f39c12', marker='D',
                label='Default threshold = 0.50')
axes[0].set_xlabel('Recall')
axes[0].set_ylabel('Precision')
axes[0].set_title('Precision-Recall Curve', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# F2 score vs threshold
axes[1].plot(thresholds, f2_scores, lw=2, color='#9b59b6', label='F2 Score')
axes[1].axvline(optimal_threshold, color='#e74c3c', linestyle='--', lw=2,
                label=f'Optimal = {optimal_threshold:.2f}')
axes[1].axvline(0.5, color='#f39c12', linestyle='--', lw=2, label='Default = 0.50')
axes[1].set_xlabel('Threshold')
axes[1].set_ylabel('F2 Score')
axes[1].set_title('F2 Score vs Decision Threshold', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# ── Classification reports: default vs optimal threshold ──────────────────
y_pred_default = (y_pred_proba >= 0.5).astype(int)
y_pred_optimal = (y_pred_proba >= optimal_threshold).astype(int)

print("\n" + "="*50)
print("Classification Report — Default threshold (0.50)")
print("="*50)
print(classification_report(y_test, y_pred_default, target_names=['Stayed', 'Churned']))

print("="*50)
print(f"Classification Report — Optimal threshold ({optimal_threshold:.2f})")
print("="*50)
print(classification_report(y_test, y_pred_optimal, target_names=['Stayed', 'Churned']))

Business reasoning: By lowering the threshold, we accept more false positives (customers incorrectly flagged for retention outreach) in exchange for catching more true churners. Given our cost analysis — $200 lost revenue per missed churner vs $50 per unnecessary intervention — this trade-off is financially justified as long as recall improves meaningfully.

9. Model Interpretation¶

Stakeholders need to trust the model. We explain it at two levels: global (which features matter most) and local (why did this specific customer get flagged).

# ── SHAP (preferred) or Permutation Importance (fallback) ─────────────────
try:
    import shap
    shap_available = True
    print("SHAP available — using TreeExplainer")
except ImportError:
    shap_available = False
    print("SHAP not installed — falling back to permutation importance.")
    print("Install with: pip install shap")

feature_names = all_feature_names  # from Section 5

if shap_available:
    # ── SHAP summary plot ──────────────────────────────────────────────────
    # Use a sample for speed
    sample_size = min(500, X_test_proc.shape[0])
    X_sample = X_test_proc[:sample_size]
    
    explainer = shap.TreeExplainer(best_model)
    shap_values = explainer.shap_values(X_sample)
    
    # Handle binary classification (shap returns list for some models)
    if isinstance(shap_values, list):
        sv = shap_values[1]  # class 1 (churn)
    else:
        sv = shap_values
    
    plt.figure(figsize=(10, 7))
    shap.summary_plot(sv, X_sample,
                      feature_names=feature_names,
                      max_display=15, show=False)
    plt.title('SHAP Feature Importance (Top 15)', fontsize=13, fontweight='bold', pad=15)
    plt.tight_layout()
    plt.show()
    
    # ── Top 10 features by mean |SHAP| ────────────────────────────────────
    mean_shap = np.abs(sv).mean(axis=0)
    top10_idx = np.argsort(mean_shap)[::-1][:10]
    print("\nTop 10 features by mean |SHAP| value:")
    for rank, idx in enumerate(top10_idx, 1):
        print(f"  {rank:2d}. {feature_names[idx]:<35s} {mean_shap[idx]:.4f}")
    
    # ── Single prediction explanation (waterfall) ─────────────────────────
    print("\n--- SHAP Waterfall: Single High-Risk Customer ---")
    high_risk_idx = np.argmax(y_pred_proba[:sample_size])
    
    try:
        # New SHAP API
        explanation = explainer(X_sample[high_risk_idx:high_risk_idx+1])
        shap.plots.waterfall(explanation[0], max_display=10, show=False)
        plt.title(f'Waterfall: Customer #{high_risk_idx} — P(churn)={y_pred_proba[high_risk_idx]:.3f}',
                  fontsize=11, fontweight='bold')
        plt.tight_layout()
        plt.show()
    except Exception:
        # Older SHAP API fallback
        shap.force_plot(
            explainer.expected_value if not isinstance(explainer.expected_value, list)
            else explainer.expected_value[1],
            sv[high_risk_idx], X_sample[high_risk_idx],
            feature_names=feature_names, matplotlib=True, show=False
        )
        plt.title(f'Force Plot: Customer #{high_risk_idx}', fontsize=11)
        plt.tight_layout()
        plt.show()

else:
    # ── Permutation Importance fallback ───────────────────────────────────
    from sklearn.inspection import permutation_importance
    
    perm_imp = permutation_importance(
        best_model, X_test_proc, y_test,
        n_repeats=10, random_state=42, n_jobs=-1,
        scoring='roc_auc'
    )
    top10_idx = np.argsort(perm_imp.importances_mean)[::-1][:10]
    
    fig, ax = plt.subplots(figsize=(9, 5))
    y_pos = range(10)
    ax.barh(y_pos,
            perm_imp.importances_mean[top10_idx[::-1]],
            xerr=perm_imp.importances_std[top10_idx[::-1]],
            color='#3498db', alpha=0.8, edgecolor='white')
    ax.set_yticks(list(y_pos))
    ax.set_yticklabels([feature_names[i] for i in top10_idx[::-1]])
    ax.set_xlabel('Mean ROC-AUC Decrease')
    ax.set_title('Top 10 Features — Permutation Importance', fontsize=12, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print("\nTop 10 features by permutation importance:")
    for rank, idx in enumerate(top10_idx, 1):
        print(f"  {rank:2d}. {feature_names[idx]:<35s} {perm_imp.importances_mean[idx]:.4f} ± {perm_imp.importances_std[idx]:.4f}")

10. Production Readiness¶

A model that lives only in a notebook is not useful. We serialize the model, write a prediction function, and demonstrate it on concrete examples.

import joblib
import os
import json

# ── Save model artifacts ───────────────────────────────────────────────────
artifacts = {
    'model':               best_model,
    'preprocessor':        preprocessor,
    'feature_names':       FEATURES,
    'optimal_threshold':   float(optimal_threshold),
    'feature_importances': dict(zip(
        all_feature_names,
        (np.abs(sv).mean(axis=0) if shap_available
         else perm_imp.importances_mean)
    ))
}

model_path = '/tmp/churn_model.joblib'
joblib.dump(artifacts, model_path)
print(f"Model saved to: {model_path}")
print(f"File size: {os.path.getsize(model_path) / 1024:.1f} KB")

# ── Verify save/load round-trip ────────────────────────────────────────────
loaded = joblib.load(model_path)
assert loaded['optimal_threshold'] == artifacts['optimal_threshold']
print("Save/load round-trip: OK")

# ── predict_churn function ─────────────────────────────────────────────────
def predict_churn(customer_dict: dict,
                  artifacts_path: str = '/tmp/churn_model.joblib') -> dict:
    """
    Predict churn probability for a single customer.
    
    Parameters
    ----------
    customer_dict : dict
        Raw customer data with keys matching the feature names used during training.
    artifacts_path : str
        Path to saved joblib artifacts file.
    
    Returns
    -------
    dict with keys:
        - churn_probability : float (0-1)
        - decision          : str ('CHURN_RISK' or 'NO_RISK')
        - threshold_used    : float
        - top_3_risk_factors: list of (feature_name, contribution) tuples
    """
    art = joblib.load(artifacts_path)
    model         = art['model']
    proc          = art['preprocessor']
    feature_names = art['feature_names']
    threshold     = art['optimal_threshold']
    importances   = art['feature_importances']
    
    # ── Apply same feature engineering ────────────────────────────────────
    df_input = pd.DataFrame([customer_dict])
    
    # Normalize contract_type
    if 'contract_type' in df_input.columns:
        df_input['contract_type'] = df_input['contract_type'].apply(normalize_contract)
    
    df_eng_input = engineer_features(df_input)
    
    # Select features in the right order
    X_input = df_eng_input[feature_names]
    X_proc  = proc.transform(X_input)
    
    # ── Predict ────────────────────────────────────────────────────────────
    prob = model.predict_proba(X_proc)[0, 1]
    decision = 'CHURN_RISK' if prob >= threshold else 'NO_RISK'
    
    # ── Top 3 risk factors (by feature importance) ─────────────────────────
    # Map raw feature values to engineered feature names for explanation
    top_features = sorted(importances.items(), key=lambda x: x[1], reverse=True)[:3]
    
    # Provide human-readable values for top features
    risk_factors = []
    for feat_name, importance in top_features:
        # Try to find the raw value
        base_feat = feat_name.split('_')[0] if '_' in feat_name else feat_name
        raw_value = customer_dict.get(feat_name, customer_dict.get(base_feat, 'N/A'))
        risk_factors.append({
            'feature':     feat_name,
            'value':       raw_value,
            'importance':  round(float(importance), 4)
        })
    
    return {
        'churn_probability': round(float(prob), 4),
        'decision':          decision,
        'threshold_used':    round(threshold, 4),
        'top_3_risk_factors': risk_factors,
    }

print("predict_churn() defined.")

# ── Demonstrate on 3 example customers ────────────────────────────────────

example_customers = [
    {
        'name': 'Customer A — Low Risk',
        'data': {
            'tenure_months':    60,
            'monthly_charges':  45.0,
            'total_charges':    2700.0,
            'contract_type':    'Two Year',
            'internet_service': 'DSL',
            'num_support_calls': 0,
            'num_products':     4,
            'payment_method':   'Bank transfer',
            'senior_citizen':   0,
            'partner':          'Yes',
            'dependents':       'Yes',
        }
    },
    {
        'name': 'Customer B — Medium Risk',
        'data': {
            'tenure_months':    18,
            'monthly_charges':  75.0,
            'total_charges':    1350.0,
            'contract_type':    'One Year',
            'internet_service': 'Fiber optic',
            'num_support_calls': 2,
            'num_products':     2,
            'payment_method':   'Electronic check',
            'senior_citizen':   0,
            'partner':          'No',
            'dependents':       'No',
        }
    },
    {
        'name': 'Customer C — High Risk',
        'data': {
            'tenure_months':    4,
            'monthly_charges':  98.0,
            'total_charges':    392.0,
            'contract_type':    'Month-to-Month',
            'internet_service': 'Fiber optic',
            'num_support_calls': 6,
            'num_products':     1,
            'payment_method':   'Electronic check',
            'senior_citizen':   1,
            'partner':          'No',
            'dependents':       'No',
        }
    },
]

print("=" * 60)
for customer in example_customers:
    result = predict_churn(customer['data'])
    print(f"\n{customer['name']}")
    print(f"  Churn Probability : {result['churn_probability']:.1%}")
    print(f"  Decision          : {result['decision']}  (threshold={result['threshold_used']})")
    print(f"  Top 3 Risk Factors:")
    for rf in result['top_3_risk_factors']:
        print(f"    - {rf['feature']:<30s} value={rf['value']!s:<15} importance={rf['importance']}")
    print("-" * 60)

11. Lessons Learned¶

What worked:¶

✓ Feature engineering (charges_per_month was #1 feature)
     → Raw monthly_charges and total_charges contained the signal,
       but their ratio normalized by tenure gave the model a cleaner signal

✓ Threshold tuning: recall jumped from 0.61 → 0.79 at threshold 0.35
     → Default 0.5 threshold is almost never optimal for imbalanced problems
     → Always tune threshold against the business objective, not accuracy

✓ SHAP explanations caught a data leakage signal early
     → If total_charges was suspiciously dominant, it might encode churn history
     → Interpretability is not just for stakeholders — it's a debugging tool

✓ ColumnTransformer kept numeric and categorical processing clean
     → All transformations are reproducible and serializable
     → engineer_features() kept consistent between train and predict_churn()

Common pitfalls avoided:¶

✗ Scaled BEFORE train/test split
     → Fit the scaler only on training data; transform test data using those stats
     → Scaling before split leaks test set distribution into training

✗ Used unstratified split with imbalanced target
     → With 27% churn rate, random splits could yield very different fold distributions
     → stratify=y ensures each fold mirrors the overall class ratio

✗ Evaluated with accuracy on imbalanced classes
     → Accuracy is misleading when classes are unequal
     → F2 score (recall-weighted) aligns with the actual business cost

✗ Reported validation set metrics as final performance
     → Final evaluation always on a HELD-OUT test set, never on the set used for tuning
     → Hyperparameter search CV was done on train set; test set touched once at the end

✗ Applied feature engineering inside the CV loop incorrectly
     → engineer_features() creates deterministic features (no fit needed)
     → For fitted transformers (imputer, scaler), always fit inside each fold

Rule of thumb for threshold selection:¶

Cost of FN > Cost of FP  →  lower threshold, prioritize recall
Cost of FP > Cost of FN  →  raise threshold, prioritize precision
Costs equal              →  optimize F1, keep threshold near 0.5

12. Exercises¶

Work through these to deepen your understanding. They are roughly ordered by difficulty.

Exercise 1 — Model Calibration

Tree-based models are often poorly calibrated — their predicted probabilities don’t match actual frequencies.

Use sklearn.calibration.CalibratedClassifierCV with both method='sigmoid' (Platt scaling) and method='isotonic'.
Compare calibration before and after using CalibrationDisplay.from_estimator().
Plot reliability diagrams for all three (uncalibrated, sigmoid, isotonic).
Answer: does calibration improve the Brier score? Does it change the optimal threshold?

from sklearn.calibration import CalibratedClassifierCV, CalibrationDisplay
calibrated = CalibratedClassifierCV(best_model, method='isotonic', cv='prefit')
calibrated.fit(X_test_proc, y_test)  # use a calibration set in practice

Exercise 2 — Profit-Driven Threshold

Instead of F2, optimize the threshold for expected profit.

Each correctly retained customer (TP) avoids $200 lost revenue but costs $50 in retention offer → net gain $150
Each missed churner (FN) costs $200
Each false alarm (FP) wastes $50 on a customer who wouldn’t leave
True negatives (TN) cost $0

Compute and plot total expected profit as a function of threshold. Find the profit-maximizing threshold. Compare it to the F2-optimal threshold.

# Hint
profits = []
for t in thresholds:
    y_pred_t = (y_pred_proba >= t).astype(int)
    tp = ((y_pred_t == 1) & (y_test == 1)).sum()
    fp = ((y_pred_t == 1) & (y_test == 0)).sum()
    fn = ((y_pred_t == 0) & (y_test == 1)).sum()
    profit = 150 * tp - 50 * fp - 200 * fn
    profits.append(profit)

Exercise 3 — SMOTE Oversampling

Compare model performance with and without SMOTE oversampling.

Use imblearn.over_sampling.SMOTE to oversample the training set.
Retrain the best model on the SMOTE-augmented training set.
Evaluate both models on the original (non-oversampled) test set.
Compare: ROC-AUC, F2, recall at the optimal threshold.
Why might SMOTE not always help?

# pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

Exercise 4 — Feature Importance Stability

Feature importances can vary across CV folds. A stable model ranks features consistently.

Run 5-fold CV manually (using StratifiedKFold).
In each fold, extract feature importances from the trained model.
Compute the rank of each feature per fold.
Plot a heatmap: rows = features, columns = folds, values = rank.
Compute Spearman rank correlation between folds to quantify stability.
Which features are stable? Which are unstable? What does instability tell you?

from scipy.stats import spearmanr

Exercise 5 — FastAPI Endpoint

Wrap predict_churn() in a production REST endpoint.

Create a FastAPI app with a POST /predict route that accepts a customer JSON body.
Define a Pydantic schema CustomerInput that validates all 11 input features.
Return a PredictionResponse with churn_probability, decision, and top_3_risk_factors.
Add a GET /health endpoint that returns model version and threshold.
Test it with httpx or requests.

See notebook 03 in statistics-mlops for a full FastAPI deployment template.

# pip install fastapi uvicorn
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI(title="Churn Prediction API")

class CustomerInput(BaseModel):
    tenure_months: int
    monthly_charges: float
    # ... add remaining fields

@app.post("/predict")
def predict(customer: CustomerInput):
    return predict_churn(customer.dict())