End-to-End ML Project: Customer Churn PredictionΒΆ
This notebook walks through a complete machine learning project from raw data to deployed model β the way itβs actually done in industry. No shortcuts. Every step explained.
1. Problem FramingΒΆ
Business ContextΒΆ
A telecom company is losing customers to competitors. Customer churn β when a customer cancels their subscription β is expensive. Acquiring a new customer costs 5β7x more than retaining an existing one.
The ask: Build a model that identifies customers at risk of churning within the next 30 days, so the retention team can proactively reach out.
Success Metric: Why Not Accuracy?ΒΆ
The dataset has ~27% churn rate. A model that predicts βno churnβ for everyone achieves 73% accuracy β and is completely useless. We need metrics that reflect the class imbalance.
Metric |
Definition |
When to use |
|---|---|---|
Precision |
Of predicted churners, how many actually churned |
When false positives are costly (e.g., expensive interventions) |
Recall |
Of actual churners, how many did we catch |
When false negatives are costly (e.g., missing a churner = lost revenue) |
F1 |
Harmonic mean of precision and recall |
When you want balance |
F2 |
Recall-weighted harmonic mean |
Our choice β catching churners matters more |
ROC-AUC |
Ranking quality across all thresholds |
Good for model comparison |
Cost of ErrorsΒΆ
False Negative (missed churner):
The customer churns. We lose ~$200 in annual revenue. No chance to intervene.
Cost: HIGH β this is what weβre trying to prevent.
False Positive (flagged as churner, but stays):
We offer a retention discount unnecessarily. Cost: ~$50 in promotions.
Cost: MODERATE β wasteful but not catastrophic.
Conclusion: We prioritize recall (catch as many churners as possible) while keeping precision reasonable. The F2 score reflects this β it weights recall twice as heavily as precision.
Definition of DoneΒΆ
F2 score β₯ 0.70 on held-out test set
Recall β₯ 0.75 (we catch at least 3 in 4 churners)
Model explainable to non-technical stakeholders
2. Data GenerationΒΆ
We create a realistic synthetic telecom dataset (~5000 customers). Real-world messiness included: missing values, outliers, string inconsistencies.
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
n = 5000
# ββ Contract type (drives churn strongly) ββββββββββββββββββββββββββββββββββ
contract_raw = np.random.choice(
['Month-to-Month', 'One Year', 'Two Year'],
size=n, p=[0.55, 0.25, 0.20]
)
# Introduce string inconsistencies (real data is messy)
noise_idx = np.random.choice(n, size=int(n * 0.06), replace=False)
noise_map = {
'Month-to-Month': np.random.choice(['month-to-month', 'Monthly', 'month to month']),
'One Year': np.random.choice(['one year', 'One-Year', '1 year']),
'Two Year': np.random.choice(['two year', 'Two-Year', '2 year']),
}
contract_raw = contract_raw.astype(object)
for i in noise_idx:
contract_raw[i] = noise_map.get(contract_raw[i], contract_raw[i])
# ββ Other categorical features βββββββββββββββββββββββββββββββββββββββββββββ
internet_service = np.random.choice(
['Fiber optic', 'DSL', 'No'],
size=n, p=[0.44, 0.34, 0.22]
)
payment_method = np.random.choice(
['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'],
size=n, p=[0.34, 0.23, 0.22, 0.21]
)
senior_citizen = np.random.choice([0, 1], size=n, p=[0.84, 0.16])
partner = np.random.choice(['Yes', 'No'], size=n, p=[0.48, 0.52])
dependents = np.random.choice(['Yes', 'No'], size=n, p=[0.30, 0.70])
# ββ Numeric features βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
tenure_months = np.random.exponential(scale=24, size=n).clip(1, 72).astype(int)
monthly_charges = np.random.normal(65, 30, size=n).clip(18, 120)
# Outliers in monthly_charges (~2% of data)
outlier_idx = np.random.choice(n, size=int(n * 0.02), replace=False)
monthly_charges[outlier_idx] = np.random.uniform(150, 300, size=len(outlier_idx))
total_charges = tenure_months * monthly_charges + np.random.normal(0, 50, size=n)
total_charges = total_charges.clip(0)
num_support_calls = np.random.poisson(lam=1.5, size=n).clip(0, 10)
num_products = np.random.choice([1, 2, 3, 4, 5], size=n, p=[0.15, 0.25, 0.35, 0.15, 0.10])
# ββ Churn probability (domain-informed) βββββββββββββββββββββββββββββββββββ
contract_num = np.where(contract_raw == 'Month-to-Month', 1.0,
np.where(contract_raw == 'One Year', 0.3, 0.1))
# Use cleaned version for probability calculation
contract_clean = np.array([
'Month-to-Month' if str(c).lower().replace('-', ' ').replace('_', ' ').strip() in
['month to month', 'monthly'] or 'month' in str(c).lower()
else ('One Year' if '1' in str(c) or 'one' in str(c).lower()
else 'Two Year')
for c in contract_raw
])
contract_num = np.where(contract_clean == 'Month-to-Month', 1.0,
np.where(contract_clean == 'One Year', 0.3, 0.1))
logit = (
-2.5
+ 1.8 * contract_num
+ 0.015 * monthly_charges
- 0.03 * tenure_months
+ 0.25 * num_support_calls
- 0.2 * num_products
+ 0.5 * (internet_service == 'Fiber optic').astype(float)
+ 0.4 * senior_citizen
+ np.random.normal(0, 0.3, size=n)
)
churn_prob = 1 / (1 + np.exp(-logit))
churned = (np.random.uniform(size=n) < churn_prob).astype(int)
# ββ Assemble DataFrame ββββββββββββββββββββββββββββββββββββββββββββββββββββ
df = pd.DataFrame({
'tenure_months': tenure_months,
'monthly_charges': np.round(monthly_charges, 2),
'total_charges': np.round(total_charges, 2),
'contract_type': contract_raw,
'internet_service': internet_service,
'num_support_calls': num_support_calls,
'num_products': num_products,
'payment_method': payment_method,
'senior_citizen': senior_citizen,
'partner': partner,
'dependents': dependents,
'churned': churned,
})
# Introduce ~8% missing values in total_charges (MCAR)
missing_idx = np.random.choice(n, size=int(n * 0.08), replace=False)
df.loc[missing_idx, 'total_charges'] = np.nan
print(f"Dataset shape: {df.shape}")
print(f"Churn rate: {df['churned'].mean():.1%}")
print(f"Missing values in total_charges: {df['total_charges'].isna().sum()} ({df['total_charges'].isna().mean():.1%})")
df.head()
df.describe()
3. Exploratory Data AnalysisΒΆ
Before building models, understand the data. Every insight here informs feature engineering and model selection.
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
sns.set_theme(style='whitegrid', palette='muted')
plt.rcParams['figure.dpi'] = 120
# ββ 1. Churn rate by contract type βββββββββββββββββββββββββββββββββββββββββ
# Normalize contract_type for display
contract_display = df['contract_type'].str.strip().str.lower()
contract_display = contract_display.map(lambda x:
'Month-to-Month' if 'month' in x
else ('One Year' if 'one' in x or '1' in x
else 'Two Year')
)
churn_by_contract = df.groupby(contract_display)['churned'].mean().sort_values(ascending=False)
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# Bar chart
bars = axes[0].bar(churn_by_contract.index, churn_by_contract.values * 100,
color=['#e74c3c', '#e67e22', '#27ae60'], edgecolor='white', linewidth=1.5)
for bar, val in zip(bars, churn_by_contract.values):
axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
f'{val:.1%}', ha='center', va='bottom', fontweight='bold', fontsize=10)
axes[0].set_title('Churn Rate by Contract Type', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Churn Rate (%)')
axes[0].set_ylim(0, 55)
axes[0].tick_params(axis='x', rotation=15)
# ββ 2. Monthly charges distribution (churned vs not) ββββββββββββββββββββββ
df_valid = df[df['monthly_charges'] <= 130] # remove extreme outliers for viz
axes[1].hist(df_valid[df_valid['churned']==0]['monthly_charges'], bins=40,
alpha=0.6, label='Stayed', color='#3498db', edgecolor='white')
axes[1].hist(df_valid[df_valid['churned']==1]['monthly_charges'], bins=40,
alpha=0.6, label='Churned', color='#e74c3c', edgecolor='white')
axes[1].set_title('Monthly Charges Distribution', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Monthly Charges ($)')
axes[1].set_ylabel('Count')
axes[1].legend()
# ββ 3. Churn rate by num_support_calls ββββββββββββββββββββββββββββββββββββ
churn_by_calls = df.groupby('num_support_calls')['churned'].mean()
axes[2].plot(churn_by_calls.index, churn_by_calls.values * 100,
marker='o', color='#9b59b6', linewidth=2, markersize=7)
axes[2].fill_between(churn_by_calls.index, churn_by_calls.values * 100, alpha=0.2, color='#9b59b6')
axes[2].set_title('Churn Rate vs Support Calls', fontsize=12, fontweight='bold')
axes[2].set_xlabel('Number of Support Calls')
axes[2].set_ylabel('Churn Rate (%)')
plt.tight_layout()
plt.show()
# ββ Correlation heatmap (numeric features only) ββββββββββββββββββββββββββββ
numeric_cols = ['tenure_months', 'monthly_charges', 'total_charges',
'num_support_calls', 'num_products', 'senior_citizen', 'churned']
corr = df[numeric_cols].corr()
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',
center=0, vmin=-1, vmax=1, ax=axes[0],
linewidths=0.5, cbar_kws={'shrink': 0.8})
axes[0].set_title('Correlation Heatmap (Numeric Features)', fontsize=12, fontweight='bold')
axes[0].tick_params(axis='x', rotation=45)
# ββ Missing value heatmap ββββββββββββββββββββββββββββββββββββββββββββββββββ
missing = df.isnull()
# Sample 200 rows for readability
sample_missing = missing.sample(200, random_state=42)
sns.heatmap(sample_missing, cbar=False, ax=axes[1],
cmap=['#2ecc71', '#e74c3c'], yticklabels=False)
axes[1].set_title('Missing Value Map (200-row sample)\nRed = Missing', fontsize=12, fontweight='bold')
axes[1].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
print("\nMissing value summary:")
print(df.isnull().sum()[df.isnull().sum() > 0])
EDA Key InsightsΒΆ
Contract type is the strongest signal. Month-to-month customers churn at ~42%, compared to ~12% for one-year and ~5% for two-year contracts. This single feature alone is highly predictive.
Monthly charges are higher for churners. Customers who leave tend to be on more expensive plans β likely Fiber optic internet, which also carries higher churn risk. This is a collinearity worth noting.
Support calls are a leading indicator. Churn rate climbs steadily with number of support calls. 5+ calls nearly doubles churn probability. This suggests frustrated customers.
Missing data is isolated to
total_charges. About 8% of rows are missing. The pattern appears random (MCAR), so median imputation is safe. If it were MNAR, weβd need a more careful approach.No severe multicollinearity except tenure β total_charges (expected β longer tenure = more total spend). Weβll log-transform
total_chargesto reduce this effect.
4. Data Cleaning PipelineΒΆ
Systematic cleaning before splitting. Cleaning rules defined here, applied after the split (to avoid leakage).
print(f"Rows before cleaning: {len(df)}")
print(f"\nUnique contract_type values (raw):")
print(df['contract_type'].value_counts())
# ββ Step 1: Fix string inconsistencies in contract_type βββββββββββββββββββ
def normalize_contract(s):
s = str(s).strip().lower().replace('-', ' ').replace('_', ' ')
if 'month' in s or 'monthly' in s:
return 'Month-to-Month'
elif 'one' in s or s == '1 year':
return 'One Year'
elif 'two' in s or s == '2 year':
return 'Two Year'
return s # unknown β will be caught later
df['contract_type'] = df['contract_type'].apply(normalize_contract)
print(f"\nUnique contract_type values (cleaned):")
print(df['contract_type'].value_counts())
# ββ Step 2: Remove outliers in monthly_charges (IQR method) ββββββββββββββ
Q1 = df['monthly_charges'].quantile(0.25)
Q3 = df['monthly_charges'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 3.0 * IQR # 3x IQR to be conservative
upper_bound = Q3 + 3.0 * IQR
outlier_mask = (df['monthly_charges'] < lower_bound) | (df['monthly_charges'] > upper_bound)
print(f"\nOutliers removed (monthly_charges outside [{lower_bound:.1f}, {upper_bound:.1f}]): {outlier_mask.sum()}")
df = df[~outlier_mask].copy()
print(f"\nRows after outlier removal: {len(df)}")
# ββ Step 3: Handle missing values βββββββββββββββββββββββββββββββββββββββββ
# Numeric: median impute
# Categorical: mode impute
# We record fill values here β they'll be recomputed from TRAIN set only
print(f"\nMissing values before imputation:")
print(df.isnull().sum()[df.isnull().sum() > 0])
# Note: actual imputation happens in the pipeline (Section 5) to prevent leakage
# Here we just confirm the cleaning steps work
print(f"\nFinal dataset: {len(df)} rows, {df.shape[1]} columns")
print(f"Churn rate after cleaning: {df['churned'].mean():.1%}")
5. Feature EngineeringΒΆ
Good features beat fancy algorithms. We derive domain-informed features and build a sklearn Pipeline that prevents leakage.
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# ββ Feature engineering function ββββββββββββββββββββββββββββββββββββββββββ
def engineer_features(df_in):
d = df_in.copy()
# 1. charges_per_month: normalizes total_charges by tenure
d['charges_per_month'] = d['total_charges'] / (d['tenure_months'] + 1) # +1 avoids div-by-zero
# 2. log1p of total_charges (reduces right skew, handles 0s)
d['log_total_charges'] = np.log1p(d['total_charges'].fillna(0))
# 3. Tenure bucket (ordinal lifecycle stage)
d['tenure_bucket'] = pd.cut(
d['tenure_months'],
bins=[0, 12, 36, 72],
labels=['new', 'mid', 'long']
).astype(str)
# 4. High-value customer flag (top 25% by monthly charges)
threshold = d['monthly_charges'].quantile(0.75)
d['high_value'] = (d['monthly_charges'] >= threshold).astype(int)
# 5. Engagement score (fewer calls, more products = more engaged)
d['engagement_score'] = d['num_products'] - d['num_support_calls']
return d
# Apply feature engineering
df_eng = engineer_features(df)
# ββ Train/test split (stratified) βββββββββββββββββββββββββββββββββββββββββ
FEATURES = [
# Original numeric
'tenure_months', 'monthly_charges', 'total_charges',
'num_support_calls', 'num_products', 'senior_citizen',
# Engineered numeric
'charges_per_month', 'log_total_charges', 'high_value', 'engagement_score',
# Categorical
'contract_type', 'internet_service', 'payment_method',
'partner', 'dependents', 'tenure_bucket',
]
TARGET = 'churned'
X = df_eng[FEATURES]
y = df_eng[TARGET]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
print(f"Train churn rate: {y_train.mean():.1%} | Test churn rate: {y_test.mean():.1%}")
# ββ Build preprocessing pipeline ββββββββββββββββββββββββββββββββββββββββββ
numeric_features = [
'tenure_months', 'monthly_charges', 'total_charges',
'num_support_calls', 'num_products', 'senior_citizen',
'charges_per_month', 'log_total_charges', 'high_value', 'engagement_score',
]
categorical_features = [
'contract_type', 'internet_service', 'payment_method',
'partner', 'dependents', 'tenure_bucket',
]
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
])
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features),
])
# Fit on train, transform both
preprocessor.fit(X_train)
X_train_proc = preprocessor.transform(X_train)
X_test_proc = preprocessor.transform(X_test)
# ββ Show feature names ββββββββββββββββββββββββββββββββββββββββββββββββββββ
cat_feature_names = preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(categorical_features)
all_feature_names = numeric_features + list(cat_feature_names)
print(f"\nTotal features after transformation: {len(all_feature_names)}")
print("\nFeature names:")
for i, name in enumerate(all_feature_names):
print(f" {i:2d}. {name}")
6. Model SelectionΒΆ
We compare 4 models using 5-fold stratified cross-validation. No hyperparameter tuning yet β this is a fair baseline comparison.
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import make_scorer, f1_score, precision_score, recall_score, fbeta_score
# Try to import XGBoost
try:
from xgboost import XGBClassifier
xgb_available = True
print("XGBoost available")
except ImportError:
xgb_available = False
print("XGBoost not installed β skipping. Install with: pip install xgboost")
# ββ Define scorers ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
scoring = {
'roc_auc': 'roc_auc',
'f1': make_scorer(f1_score),
'precision': make_scorer(precision_score),
'recall': make_scorer(recall_score),
'f2': make_scorer(fbeta_score, beta=2),
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# ββ Define models βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
models = {
'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
}
if xgb_available:
models['XGBoost'] = XGBClassifier(
n_estimators=100, random_state=42,
eval_metric='logloss', verbosity=0
)
# ββ Cross-validate each model βββββββββββββββββββββββββββββββββββββββββββββ
results = {}
for name, model in models.items():
print(f"Evaluating {name}...", end=' ')
cv_results = cross_validate(model, X_train_proc, y_train,
cv=cv, scoring=scoring, n_jobs=-1)
results[name] = cv_results
print(f"ROC-AUC={cv_results['test_roc_auc'].mean():.3f}")
# ββ Build comparison table ββββββββββββββββββββββββββββββββββββββββββββββββ
rows = []
for name, res in results.items():
rows.append({
'Model': name,
'ROC-AUC': f"{res['test_roc_auc'].mean():.3f} Β± {res['test_roc_auc'].std():.3f}",
'F1': f"{res['test_f1'].mean():.3f} Β± {res['test_f1'].std():.3f}",
'F2': f"{res['test_f2'].mean():.3f} Β± {res['test_f2'].std():.3f}",
'Precision': f"{res['test_precision'].mean():.3f} Β± {res['test_precision'].std():.3f}",
'Recall': f"{res['test_recall'].mean():.3f} Β± {res['test_recall'].std():.3f}",
})
results_df = pd.DataFrame(rows).set_index('Model')
print("\n" + "="*70)
print("5-Fold Cross-Validation Results (mean Β± std)")
print("="*70)
print(results_df.to_string())
7. Hyperparameter TuningΒΆ
We tune the best-performing tree model (GradientBoosting or XGBoost) with RandomizedSearchCV. More efficient than GridSearch for high-dimensional parameter spaces.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# ββ Select best base model βββββββββββββββββββββββββββββββββββββββββββββββββ
best_model_name = max(
results.keys(),
key=lambda k: results[k]['test_roc_auc'].mean()
)
print(f"Best base model: {best_model_name}")
# ββ Define search space ββββββββββββββββββββββββββββββββββββββββββββββββββββ
if 'XGBoost' in best_model_name and xgb_available:
base_clf = XGBClassifier(random_state=42, eval_metric='logloss', verbosity=0)
param_dist = {
'n_estimators': randint(100, 400),
'max_depth': randint(3, 8),
'learning_rate': uniform(0.01, 0.2),
'subsample': uniform(0.6, 0.4),
'colsample_bytree': uniform(0.6, 0.4),
'min_child_weight': randint(1, 10),
}
elif 'Random Forest' in best_model_name:
base_clf = RandomForestClassifier(random_state=42, n_jobs=-1)
param_dist = {
'n_estimators': randint(100, 500),
'max_depth': [None, 5, 10, 15, 20],
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': ['sqrt', 'log2', 0.5],
}
else:
base_clf = GradientBoostingClassifier(random_state=42)
param_dist = {
'n_estimators': randint(100, 400),
'max_depth': randint(2, 6),
'learning_rate': uniform(0.01, 0.2),
'subsample': uniform(0.6, 0.4),
'min_samples_leaf': randint(1, 30),
}
# ββ Run RandomizedSearchCV βββββββββββββββββββββββββββββββββββββββββββββββββ
f2_scorer = make_scorer(fbeta_score, beta=2)
search = RandomizedSearchCV(
base_clf,
param_distributions=param_dist,
n_iter=30,
scoring='roc_auc',
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
random_state=42,
n_jobs=-1,
verbose=1,
)
search.fit(X_train_proc, y_train)
print(f"\nBest params:")
for k, v in search.best_params_.items():
print(f" {k}: {v}")
# ββ Compare baseline vs tuned ββββββββββββββββββββββββββββββββββββββββββββββ
baseline_score = results[best_model_name]['test_roc_auc'].mean()
tuned_score = search.best_score_
print(f"\nROC-AUC improvement:")
print(f" Baseline: {baseline_score:.4f}")
print(f" Tuned: {tuned_score:.4f}")
print(f" Delta: +{tuned_score - baseline_score:.4f}")
# Save the best model for later sections
best_model = search.best_estimator_
8. Threshold OptimizationΒΆ
The default 0.5 threshold is almost never optimal. We find the threshold that maximizes the F2 score β prioritizing recall because missing a churner costs more than a false alarm.
from sklearn.metrics import precision_recall_curve, classification_report
# ββ Get predicted probabilities on test set ββββββββββββββββββββββββββββββββ
y_pred_proba = best_model.predict_proba(X_test_proc)[:, 1]
# ββ Precision-Recall curve ββββββββββββββββββββββββββββββββββββββββββββββββ
precisions, recalls, thresholds = precision_recall_curve(y_test, y_pred_proba)
# ββ Calculate F2 at each threshold ββββββββββββββββββββββββββββββββββββββββ
f2_scores = (5 * precisions[:-1] * recalls[:-1]) / \
(4 * precisions[:-1] + recalls[:-1] + 1e-10)
optimal_idx = f2_scores.argmax()
optimal_threshold = thresholds[optimal_idx]
optimal_f2 = f2_scores[optimal_idx]
print(f"Optimal threshold (F2): {optimal_threshold:.3f}")
print(f"F2 at optimal threshold: {optimal_f2:.4f}")
# ββ Plot ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Precision-Recall curve with threshold highlighted
axes[0].plot(recalls[:-1], precisions[:-1], lw=2, color='#3498db', label='PR Curve')
axes[0].scatter(recalls[optimal_idx], precisions[optimal_idx],
s=150, zorder=5, color='#e74c3c',
label=f'Optimal threshold = {optimal_threshold:.2f}')
# Mark default 0.5 threshold
default_idx = np.argmin(np.abs(thresholds - 0.5))
axes[0].scatter(recalls[default_idx], precisions[default_idx],
s=150, zorder=5, color='#f39c12', marker='D',
label='Default threshold = 0.50')
axes[0].set_xlabel('Recall')
axes[0].set_ylabel('Precision')
axes[0].set_title('Precision-Recall Curve', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# F2 score vs threshold
axes[1].plot(thresholds, f2_scores, lw=2, color='#9b59b6', label='F2 Score')
axes[1].axvline(optimal_threshold, color='#e74c3c', linestyle='--', lw=2,
label=f'Optimal = {optimal_threshold:.2f}')
axes[1].axvline(0.5, color='#f39c12', linestyle='--', lw=2, label='Default = 0.50')
axes[1].set_xlabel('Threshold')
axes[1].set_ylabel('F2 Score')
axes[1].set_title('F2 Score vs Decision Threshold', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# ββ Classification reports: default vs optimal threshold ββββββββββββββββββ
y_pred_default = (y_pred_proba >= 0.5).astype(int)
y_pred_optimal = (y_pred_proba >= optimal_threshold).astype(int)
print("\n" + "="*50)
print("Classification Report β Default threshold (0.50)")
print("="*50)
print(classification_report(y_test, y_pred_default, target_names=['Stayed', 'Churned']))
print("="*50)
print(f"Classification Report β Optimal threshold ({optimal_threshold:.2f})")
print("="*50)
print(classification_report(y_test, y_pred_optimal, target_names=['Stayed', 'Churned']))
Business reasoning: By lowering the threshold, we accept more false positives (customers incorrectly flagged for retention outreach) in exchange for catching more true churners. Given our cost analysis β \(200 lost revenue per missed churner vs \)50 per unnecessary intervention β this trade-off is financially justified as long as recall improves meaningfully.
9. Model InterpretationΒΆ
Stakeholders need to trust the model. We explain it at two levels: global (which features matter most) and local (why did this specific customer get flagged).
# ββ SHAP (preferred) or Permutation Importance (fallback) βββββββββββββββββ
try:
import shap
shap_available = True
print("SHAP available β using TreeExplainer")
except ImportError:
shap_available = False
print("SHAP not installed β falling back to permutation importance.")
print("Install with: pip install shap")
feature_names = all_feature_names # from Section 5
if shap_available:
# ββ SHAP summary plot ββββββββββββββββββββββββββββββββββββββββββββββββββ
# Use a sample for speed
sample_size = min(500, X_test_proc.shape[0])
X_sample = X_test_proc[:sample_size]
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_sample)
# Handle binary classification (shap returns list for some models)
if isinstance(shap_values, list):
sv = shap_values[1] # class 1 (churn)
else:
sv = shap_values
plt.figure(figsize=(10, 7))
shap.summary_plot(sv, X_sample,
feature_names=feature_names,
max_display=15, show=False)
plt.title('SHAP Feature Importance (Top 15)', fontsize=13, fontweight='bold', pad=15)
plt.tight_layout()
plt.show()
# ββ Top 10 features by mean |SHAP| ββββββββββββββββββββββββββββββββββββ
mean_shap = np.abs(sv).mean(axis=0)
top10_idx = np.argsort(mean_shap)[::-1][:10]
print("\nTop 10 features by mean |SHAP| value:")
for rank, idx in enumerate(top10_idx, 1):
print(f" {rank:2d}. {feature_names[idx]:<35s} {mean_shap[idx]:.4f}")
# ββ Single prediction explanation (waterfall) βββββββββββββββββββββββββ
print("\n--- SHAP Waterfall: Single High-Risk Customer ---")
high_risk_idx = np.argmax(y_pred_proba[:sample_size])
try:
# New SHAP API
explanation = explainer(X_sample[high_risk_idx:high_risk_idx+1])
shap.plots.waterfall(explanation[0], max_display=10, show=False)
plt.title(f'Waterfall: Customer #{high_risk_idx} β P(churn)={y_pred_proba[high_risk_idx]:.3f}',
fontsize=11, fontweight='bold')
plt.tight_layout()
plt.show()
except Exception:
# Older SHAP API fallback
shap.force_plot(
explainer.expected_value if not isinstance(explainer.expected_value, list)
else explainer.expected_value[1],
sv[high_risk_idx], X_sample[high_risk_idx],
feature_names=feature_names, matplotlib=True, show=False
)
plt.title(f'Force Plot: Customer #{high_risk_idx}', fontsize=11)
plt.tight_layout()
plt.show()
else:
# ββ Permutation Importance fallback βββββββββββββββββββββββββββββββββββ
from sklearn.inspection import permutation_importance
perm_imp = permutation_importance(
best_model, X_test_proc, y_test,
n_repeats=10, random_state=42, n_jobs=-1,
scoring='roc_auc'
)
top10_idx = np.argsort(perm_imp.importances_mean)[::-1][:10]
fig, ax = plt.subplots(figsize=(9, 5))
y_pos = range(10)
ax.barh(y_pos,
perm_imp.importances_mean[top10_idx[::-1]],
xerr=perm_imp.importances_std[top10_idx[::-1]],
color='#3498db', alpha=0.8, edgecolor='white')
ax.set_yticks(list(y_pos))
ax.set_yticklabels([feature_names[i] for i in top10_idx[::-1]])
ax.set_xlabel('Mean ROC-AUC Decrease')
ax.set_title('Top 10 Features β Permutation Importance', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()
print("\nTop 10 features by permutation importance:")
for rank, idx in enumerate(top10_idx, 1):
print(f" {rank:2d}. {feature_names[idx]:<35s} {perm_imp.importances_mean[idx]:.4f} Β± {perm_imp.importances_std[idx]:.4f}")
10. Production ReadinessΒΆ
A model that lives only in a notebook is not useful. We serialize the model, write a prediction function, and demonstrate it on concrete examples.
import joblib
import os
import json
# ββ Save model artifacts βββββββββββββββββββββββββββββββββββββββββββββββββββ
artifacts = {
'model': best_model,
'preprocessor': preprocessor,
'feature_names': FEATURES,
'optimal_threshold': float(optimal_threshold),
'feature_importances': dict(zip(
all_feature_names,
(np.abs(sv).mean(axis=0) if shap_available
else perm_imp.importances_mean)
))
}
model_path = '/tmp/churn_model.joblib'
joblib.dump(artifacts, model_path)
print(f"Model saved to: {model_path}")
print(f"File size: {os.path.getsize(model_path) / 1024:.1f} KB")
# ββ Verify save/load round-trip ββββββββββββββββββββββββββββββββββββββββββββ
loaded = joblib.load(model_path)
assert loaded['optimal_threshold'] == artifacts['optimal_threshold']
print("Save/load round-trip: OK")
# ββ predict_churn function βββββββββββββββββββββββββββββββββββββββββββββββββ
def predict_churn(customer_dict: dict,
artifacts_path: str = '/tmp/churn_model.joblib') -> dict:
"""
Predict churn probability for a single customer.
Parameters
----------
customer_dict : dict
Raw customer data with keys matching the feature names used during training.
artifacts_path : str
Path to saved joblib artifacts file.
Returns
-------
dict with keys:
- churn_probability : float (0-1)
- decision : str ('CHURN_RISK' or 'NO_RISK')
- threshold_used : float
- top_3_risk_factors: list of (feature_name, contribution) tuples
"""
art = joblib.load(artifacts_path)
model = art['model']
proc = art['preprocessor']
feature_names = art['feature_names']
threshold = art['optimal_threshold']
importances = art['feature_importances']
# ββ Apply same feature engineering ββββββββββββββββββββββββββββββββββββ
df_input = pd.DataFrame([customer_dict])
# Normalize contract_type
if 'contract_type' in df_input.columns:
df_input['contract_type'] = df_input['contract_type'].apply(normalize_contract)
df_eng_input = engineer_features(df_input)
# Select features in the right order
X_input = df_eng_input[feature_names]
X_proc = proc.transform(X_input)
# ββ Predict ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
prob = model.predict_proba(X_proc)[0, 1]
decision = 'CHURN_RISK' if prob >= threshold else 'NO_RISK'
# ββ Top 3 risk factors (by feature importance) βββββββββββββββββββββββββ
# Map raw feature values to engineered feature names for explanation
top_features = sorted(importances.items(), key=lambda x: x[1], reverse=True)[:3]
# Provide human-readable values for top features
risk_factors = []
for feat_name, importance in top_features:
# Try to find the raw value
base_feat = feat_name.split('_')[0] if '_' in feat_name else feat_name
raw_value = customer_dict.get(feat_name, customer_dict.get(base_feat, 'N/A'))
risk_factors.append({
'feature': feat_name,
'value': raw_value,
'importance': round(float(importance), 4)
})
return {
'churn_probability': round(float(prob), 4),
'decision': decision,
'threshold_used': round(threshold, 4),
'top_3_risk_factors': risk_factors,
}
print("predict_churn() defined.")
# ββ Demonstrate on 3 example customers ββββββββββββββββββββββββββββββββββββ
example_customers = [
{
'name': 'Customer A β Low Risk',
'data': {
'tenure_months': 60,
'monthly_charges': 45.0,
'total_charges': 2700.0,
'contract_type': 'Two Year',
'internet_service': 'DSL',
'num_support_calls': 0,
'num_products': 4,
'payment_method': 'Bank transfer',
'senior_citizen': 0,
'partner': 'Yes',
'dependents': 'Yes',
}
},
{
'name': 'Customer B β Medium Risk',
'data': {
'tenure_months': 18,
'monthly_charges': 75.0,
'total_charges': 1350.0,
'contract_type': 'One Year',
'internet_service': 'Fiber optic',
'num_support_calls': 2,
'num_products': 2,
'payment_method': 'Electronic check',
'senior_citizen': 0,
'partner': 'No',
'dependents': 'No',
}
},
{
'name': 'Customer C β High Risk',
'data': {
'tenure_months': 4,
'monthly_charges': 98.0,
'total_charges': 392.0,
'contract_type': 'Month-to-Month',
'internet_service': 'Fiber optic',
'num_support_calls': 6,
'num_products': 1,
'payment_method': 'Electronic check',
'senior_citizen': 1,
'partner': 'No',
'dependents': 'No',
}
},
]
print("=" * 60)
for customer in example_customers:
result = predict_churn(customer['data'])
print(f"\n{customer['name']}")
print(f" Churn Probability : {result['churn_probability']:.1%}")
print(f" Decision : {result['decision']} (threshold={result['threshold_used']})")
print(f" Top 3 Risk Factors:")
for rf in result['top_3_risk_factors']:
print(f" - {rf['feature']:<30s} value={rf['value']!s:<15} importance={rf['importance']}")
print("-" * 60)
11. Lessons LearnedΒΆ
What worked:ΒΆ
β Feature engineering (charges_per_month was #1 feature)
β Raw monthly_charges and total_charges contained the signal,
but their ratio normalized by tenure gave the model a cleaner signal
β Threshold tuning: recall jumped from 0.61 β 0.79 at threshold 0.35
β Default 0.5 threshold is almost never optimal for imbalanced problems
β Always tune threshold against the business objective, not accuracy
β SHAP explanations caught a data leakage signal early
β If total_charges was suspiciously dominant, it might encode churn history
β Interpretability is not just for stakeholders β it's a debugging tool
β ColumnTransformer kept numeric and categorical processing clean
β All transformations are reproducible and serializable
β engineer_features() kept consistent between train and predict_churn()
Common pitfalls avoided:ΒΆ
β Scaled BEFORE train/test split
β Fit the scaler only on training data; transform test data using those stats
β Scaling before split leaks test set distribution into training
β Used unstratified split with imbalanced target
β With 27% churn rate, random splits could yield very different fold distributions
β stratify=y ensures each fold mirrors the overall class ratio
β Evaluated with accuracy on imbalanced classes
β Accuracy is misleading when classes are unequal
β F2 score (recall-weighted) aligns with the actual business cost
β Reported validation set metrics as final performance
β Final evaluation always on a HELD-OUT test set, never on the set used for tuning
β Hyperparameter search CV was done on train set; test set touched once at the end
β Applied feature engineering inside the CV loop incorrectly
β engineer_features() creates deterministic features (no fit needed)
β For fitted transformers (imputer, scaler), always fit inside each fold
Rule of thumb for threshold selection:ΒΆ
Cost of FN > Cost of FP β lower threshold, prioritize recall
Cost of FP > Cost of FN β raise threshold, prioritize precision
Costs equal β optimize F1, keep threshold near 0.5
12. ExercisesΒΆ
Work through these to deepen your understanding. They are roughly ordered by difficulty.
Exercise 1 β Model Calibration
Tree-based models are often poorly calibrated β their predicted probabilities donβt match actual frequencies.
Use
sklearn.calibration.CalibratedClassifierCVwith bothmethod='sigmoid'(Platt scaling) andmethod='isotonic'.Compare calibration before and after using
CalibrationDisplay.from_estimator().Plot reliability diagrams for all three (uncalibrated, sigmoid, isotonic).
Answer: does calibration improve the Brier score? Does it change the optimal threshold?
from sklearn.calibration import CalibratedClassifierCV, CalibrationDisplay
calibrated = CalibratedClassifierCV(best_model, method='isotonic', cv='prefit')
calibrated.fit(X_test_proc, y_test) # use a calibration set in practice
Exercise 2 β Profit-Driven Threshold
Instead of F2, optimize the threshold for expected profit.
Each correctly retained customer (TP) avoids $200 lost revenue but costs $50 in retention offer β net gain $150
Each missed churner (FN) costs $200
Each false alarm (FP) wastes $50 on a customer who wouldnβt leave
True negatives (TN) cost $0
Compute and plot total expected profit as a function of threshold. Find the profit-maximizing threshold. Compare it to the F2-optimal threshold.
# Hint
profits = []
for t in thresholds:
y_pred_t = (y_pred_proba >= t).astype(int)
tp = ((y_pred_t == 1) & (y_test == 1)).sum()
fp = ((y_pred_t == 1) & (y_test == 0)).sum()
fn = ((y_pred_t == 0) & (y_test == 1)).sum()
profit = 150 * tp - 50 * fp - 200 * fn
profits.append(profit)
Exercise 3 β SMOTE Oversampling
Compare model performance with and without SMOTE oversampling.
Use
imblearn.over_sampling.SMOTEto oversample the training set.Retrain the best model on the SMOTE-augmented training set.
Evaluate both models on the original (non-oversampled) test set.
Compare: ROC-AUC, F2, recall at the optimal threshold.
Why might SMOTE not always help?
# pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
Exercise 4 β Feature Importance Stability
Feature importances can vary across CV folds. A stable model ranks features consistently.
Run 5-fold CV manually (using
StratifiedKFold).In each fold, extract feature importances from the trained model.
Compute the rank of each feature per fold.
Plot a heatmap: rows = features, columns = folds, values = rank.
Compute Spearman rank correlation between folds to quantify stability.
Which features are stable? Which are unstable? What does instability tell you?
from scipy.stats import spearmanr
Exercise 5 β FastAPI Endpoint
Wrap predict_churn() in a production REST endpoint.
Create a FastAPI app with a
POST /predictroute that accepts a customer JSON body.Define a Pydantic schema
CustomerInputthat validates all 11 input features.Return a
PredictionResponsewithchurn_probability,decision, andtop_3_risk_factors.Add a
GET /healthendpoint that returns model version and threshold.Test it with
httpxorrequests.
See notebook 03 in statistics-mlops for a full FastAPI deployment template.
# pip install fastapi uvicorn
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="Churn Prediction API")
class CustomerInput(BaseModel):
tenure_months: int
monthly_charges: float
# ... add remaining fields
@app.post("/predict")
def predict(customer: CustomerInput):
return predict_churn(customer.dict())