Causal Inference: Moving Beyond CorrelationΒΆ
Correlation tells you what. Causation tells you what to do. This notebook covers the Potential Outcomes framework, observational study pitfalls, propensity score matching, and instrumental variables β the statistical tools for making causal claims from non-experimental data.
SetupΒΆ
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg') # non-interactive backend
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
print('Libraries loaded.')
1. The Fundamental Problem of Causal InferenceΒΆ
Potential Outcomes Framework (Rubin):
For each unit \(i\), define:
\(Y_i(1)\) = outcome if unit receives treatment
\(Y_i(0)\) = outcome if unit does NOT receive treatment
The individual treatment effect: \(\tau_i = Y_i(1) - Y_i(0)\)
The fundamental problem: we only ever observe ONE of \(Y_i(1)\) or \(Y_i(0)\). The other is counterfactual β it never happened.
We can estimate the Average Treatment Effect (ATE): \(\tau = \mathbb{E}[Y(1) - Y(0)]\) under certain assumptions.
# Synthetic dataset: does a marketing email CAUSE higher purchase rates?
# Confounder: user engagement level (highly engaged users open emails AND buy more)
N = 2000
rng = np.random.default_rng(42)
# Confounder: engagement score (0-10)
engagement = rng.uniform(0, 10, N)
# True causal effect of email: +0.05 purchase probability
TRUE_EFFECT = 0.05
# Treatment assignment: biased by engagement (not random!)
# Higher engagement β more likely to receive email
p_treat = 1 / (1 + np.exp(-(0.5 * engagement - 2.5)))
treatment = rng.binomial(1, p_treat).astype(float)
# Potential outcomes
# Purchase probability depends on engagement (confounder) + treatment effect
p_purchase_0 = 1 / (1 + np.exp(-(0.4 * engagement - 2.0))) # Y(0)
p_purchase_1 = np.clip(p_purchase_0 + TRUE_EFFECT, 0, 1) # Y(1)
# Observed outcome: Y = Y(T)
purchase = np.where(treatment == 1,
rng.binomial(1, p_purchase_1),
rng.binomial(1, p_purchase_0)).astype(float)
df = pd.DataFrame({
'engagement' : engagement,
'treatment' : treatment,
'purchase' : purchase,
'p_Y0' : p_purchase_0,
'p_Y1' : p_purchase_1,
})
print(f'Samples: {N}')
print(f'Treatment rate: {treatment.mean():.2f}')
print(f'Purchase rate (treated): {purchase[treatment==1].mean():.3f}')
print(f'Purchase rate (control): {purchase[treatment==0].mean():.3f}')
print(f'Naive ATE (biased): {purchase[treatment==1].mean() - purchase[treatment==0].mean():.3f}')
print(f'TRUE causal effect (ATE): {TRUE_EFFECT:.3f}')
The naive estimate is inflated because treated users (who received the email) were already more engaged β they would have bought more anyway. This is confounding.
2. Confounding β Simpsonβs ParadoxΒΆ
Simpsonβs Paradox: a trend that appears in several groups reverses when the groups are combined. This happens when a confounder determines both group membership and the outcome.
# Simpson's Paradox example
# Question: does Drug X cure patients?
# Confounder: disease severity (mild vs severe)
data_simpson = pd.DataFrame({
'group' : ['Drug', 'Drug', 'No Drug', 'No Drug'],
'severity' : ['Mild', 'Severe', 'Mild', 'Severe'],
'n_treated': [81, 192, 234, 55],
'n_total' : [87, 263, 270, 80],
})
data_simpson['recovery_rate'] = data_simpson['n_treated'] / data_simpson['n_total']
print('Recovery rates BY SEVERITY GROUP:')
print(data_simpson[['group', 'severity', 'n_treated', 'n_total', 'recovery_rate']].to_string(index=False))
# Aggregate (ignoring severity)
drug_total = data_simpson[data_simpson['group'] == 'Drug'][['n_treated','n_total']].sum()
nodrug_total = data_simpson[data_simpson['group'] == 'No Drug'][['n_treated','n_total']].sum()
drug_rate = drug_total['n_treated'] / drug_total['n_total']
nodrug_rate = nodrug_total['n_treated'] / nodrug_total['n_total']
print(f'\nAGGREGATED (ignoring severity):')
print(f' Drug: {drug_rate:.3f}')
print(f' No Drug: {nodrug_rate:.3f}')
print(f' Naive conclusion: Drug is WORSE? {drug_rate < nodrug_rate}')
print('\nWHY: Drug was given to sicker patients (higher severity). Controlling for severity,')
print(' drug HELPS in both groups β classic Simpsons Paradox!')
3. Directed Acyclic Graphs (DAGs) β When to Control for a VariableΒΆ
DAGs help us decide which variables to include in our model. Three key patterns:
FORK (confounder): CHAIN (mediator): COLLIDER (inverted):
Z β T β Y T β Z β Y T β Z β Y
Z β Y
Control for Z? YES Control for Z? NO Control for Z? NO!
(blocks backdoor) (blocks the effect) (opens spurious path)
# Our DAG: Engagement β Treatment β Purchase
# Engagement β Purchase
# Engagement is a FORK / confounder β must control for it
# Collider example: conditioning on a collider creates spurious correlation
rng2 = np.random.default_rng(10)
n = 1000
# T and Y are independent
T_collider = rng2.normal(0, 1, n) # treatment (cause 1)
Y_collider = rng2.normal(0, 1, n) # outcome (cause 2)
Z_collider = T_collider + Y_collider + rng2.normal(0, 0.5, n) # collider: caused by both
# Without conditioning on Z: T and Y are independent
corr_unconditional = np.corrcoef(T_collider, Y_collider)[0, 1]
# Conditioning on Z = 0 (collider): creates spurious correlation
mask_z = np.abs(Z_collider) < 0.3
corr_conditional = np.corrcoef(T_collider[mask_z], Y_collider[mask_z])[0, 1]
print('Collider bias demonstration:')
print(f' Corr(T, Y) without conditioning on Z : {corr_unconditional:.4f} β near zero (independent)')
print(f' Corr(T, Y) conditioning on Z β 0 : {corr_conditional:.4f} β spurious correlation!')
print('\nLesson: NEVER condition on a collider β it opens a spurious backdoor path.')
print('\nOur marketing DAG:')
print(' Engagement (Z) --> Treatment (T) --> Purchase (Y)')
print(' Engagement (Z) -----------------> Purchase (Y)')
print(' Z is a confounder β must include it in our model')
4. Propensity Score Matching (PSM)ΒΆ
Propensity score: \(e(X) = P(T=1 | X)\) β the probability of receiving treatment given covariates.
Idea: Match each treated unit with a control unit that has a similar propensity score. After matching, treatment assignment is βas-good-as-randomβ conditional on the score.
Steps:
Estimate propensity scores with logistic regression
Match treated to controls with similar scores (nearest-neighbor)
Compare outcomes in matched sample
from sklearn.neighbors import NearestNeighbors
# Step 1: Estimate propensity scores
X = df[['engagement']].values
T = df['treatment'].values
Y = df['purchase'].values
ps_model = LogisticRegression(random_state=42)
ps_model.fit(X, T)
propensity_scores = ps_model.predict_proba(X)[:, 1]
df['propensity_score'] = propensity_scores
# Step 2: Nearest-neighbor matching (1:1, without replacement)
treated_idx = np.where(T == 1)[0]
control_idx = np.where(T == 0)[0]
ps_control = propensity_scores[control_idx].reshape(-1, 1)
ps_treated = propensity_scores[treated_idx].reshape(-1, 1)
nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(ps_control)
distances, matched_control_pos = nn.kneighbors(ps_treated)
matched_control_idx = control_idx[matched_control_pos.flatten()]
# Step 3: Estimate ATE on matched sample
Y_treated = Y[treated_idx]
Y_control = Y[matched_control_idx]
psm_ate = (Y_treated - Y_control).mean()
# Covariate balance check
eng_treated_before = df.loc[treated_idx, 'engagement'].mean()
eng_control_before = df.loc[control_idx, 'engagement'].mean()
eng_treated_after = df.loc[treated_idx, 'engagement'].mean()
eng_control_after = df.loc[matched_control_idx, 'engagement'].mean()
print('=== Propensity Score Matching Results ===')
print(f'Naive ATE (no adjustment) : {Y[T==1].mean() - Y[T==0].mean():.4f} (biased upward)')
print(f'PSM ATE : {psm_ate:.4f}')
print(f'TRUE causal effect : {TRUE_EFFECT:.4f}')
print(f'\nCovariate balance (engagement):')
print(f' Before matching β treated: {eng_treated_before:.2f}, control: {eng_control_before:.2f}')
print(f' After matching β treated: {eng_treated_after:.2f}, control: {eng_control_after:.2f}')
print(f' Max PS distance (caliper): {distances.max():.4f}')
5. Inverse Probability Weighting (IPW)ΒΆ
Instead of discarding unmatched units, reweight observations by the inverse of their propensity score:
Treated units with low propensity score get upweighted (they were unlikely to be treated, so theyβre informative controls).
e = propensity_scores
# Clip propensity scores to avoid extreme weights
e_clipped = np.clip(e, 0.01, 0.99)
# IPW estimator
ipw_ate = np.mean(T * Y / e_clipped - (1 - T) * Y / (1 - e_clipped))
# Stabilized IPW (normalize weights)
w_treated = T / e_clipped
w_control = (1 - T) / (1 - e_clipped)
sipw_ate = (np.sum(w_treated * Y) / np.sum(w_treated) -
np.sum(w_control * Y) / np.sum(w_control))
print('=== Inverse Probability Weighting ===')
print(f'IPW ATE (unstabilized) : {ipw_ate:.4f}')
print(f'IPW ATE (stabilized) : {sipw_ate:.4f}')
print(f'TRUE causal effect : {TRUE_EFFECT:.4f}')
# Show effective sample sizes
ess_treated = np.sum(w_treated)**2 / np.sum(w_treated**2)
ess_control = np.sum(w_control)**2 / np.sum(w_control**2)
print(f'\nEffective sample sizes:')
print(f' Treated n={T.sum():.0f} ESS={ess_treated:.0f}')
print(f' Control n={(1-T).sum():.0f} ESS={ess_control:.0f}')
6. Double Machine Learning (DML)ΒΆ
DML (Chernozhukov et al. 2018) is a modern approach that:
Uses ML to partial out the effect of confounders from both outcome and treatment
Then regresses the residuals to get the causal effect
This is Robinsonβs (1988) partial linear model estimated with ML:
Procedure:
Fit \(\hat{g}(X) = E[Y | X]\) with ML β compute \(\tilde{Y} = Y - \hat{g}(X)\)
Fit \(\hat{e}(X) = E[T | X]\) with ML β compute \(\tilde{T} = T - \hat{e}(X)\)
Regress \(\tilde{Y}\) on \(\tilde{T}\): the coefficient is \(\hat{\tau}_{DML}\)
Use cross-fitting to avoid overfitting bias.
from sklearn.model_selection import KFold
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
def double_ml(X, T, Y, n_splits=5, random_state=42):
"""
Double Machine Learning with cross-fitting.
Returns estimated ATE.
"""
n = len(Y)
Y_res = np.zeros(n) # residuals: Y - E[Y|X]
T_res = np.zeros(n) # residuals: T - E[T|X]
kf = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
for fold, (train_idx, test_idx) in enumerate(kf.split(X)):
X_tr, X_te = X[train_idx], X[test_idx]
Y_tr, T_tr = Y[train_idx], T[train_idx]
# Model 1: predict Y from X
m_y = GradientBoostingRegressor(n_estimators=100, random_state=random_state)
m_y.fit(X_tr, Y_tr)
Y_res[test_idx] = Y[test_idx] - m_y.predict(X_te)
# Model 2: predict T from X (propensity model)
m_t = GradientBoostingClassifier(n_estimators=100, random_state=random_state)
m_t.fit(X_tr, T_tr)
T_res[test_idx] = T[test_idx] - m_t.predict_proba(X_te)[:, 1]
# Final OLS on residuals
T_res_2d = T_res.reshape(-1, 1)
dml_model = LinearRegression(fit_intercept=False)
dml_model.fit(T_res_2d, Y_res)
ate = dml_model.coef_[0]
# Bootstrap SE
rng_b = np.random.default_rng(0)
boot_ates = []
for _ in range(200):
idx = rng_b.integers(0, n, n)
lr = LinearRegression(fit_intercept=False)
lr.fit(T_res[idx].reshape(-1,1), Y_res[idx])
boot_ates.append(lr.coef_[0])
se = np.std(boot_ates)
return ate, se
print('Running Double Machine Learning (5-fold cross-fitting)...')
dml_ate, dml_se = double_ml(X, T, Y)
print(f'\nDML ATE : {dml_ate:.4f} Β± {1.96*dml_se:.4f} (95% CI)')
print(f'True ATE : {TRUE_EFFECT:.4f}')
print(f'Naive ATE : {Y[T==1].mean() - Y[T==0].mean():.4f}')
print(f'PSM ATE : {psm_ate:.4f}')
7. Sensitivity Analysis β Robustness to Unmeasured ConfoundersΒΆ
Question: How large would an unmeasured confounder need to be to explain away our estimated effect?
We use Rosenbaumβs sensitivity analysis concept (simplified): vary the assumed odds ratio \(\Gamma\) for hidden bias. If results hold for large \(\Gamma\), the study is robust.
# Simplified sensitivity analysis: E-value
# E-value: minimum strength of association (on risk ratio scale) that an
# unmeasured confounder would need to have with BOTH treatment and outcome
# to fully explain away the observed effect.
def e_value(rr_observed):
"""
E-value for a risk ratio.
VanderWeele & Ding (2017).
"""
if rr_observed < 1:
rr_observed = 1 / rr_observed
return rr_observed + np.sqrt(rr_observed * (rr_observed - 1))
# Convert ATE (probability difference) to risk ratio
baseline_risk = Y[T == 0].mean()
treated_risk = Y[T == 1].mean()
dml_risk = baseline_risk + dml_ate
rr_naive = treated_risk / baseline_risk
rr_dml = dml_risk / baseline_risk
ev_naive = e_value(rr_naive)
ev_dml = e_value(rr_dml)
print('=== Sensitivity Analysis (E-values) ===')
print(f'Baseline purchase rate (control): {baseline_risk:.3f}')
print(f'Observed risk ratio (naive): {rr_naive:.3f} β E-value: {ev_naive:.3f}')
print(f'DML-adjusted risk ratio: {rr_dml:.3f} β E-value: {ev_dml:.3f}')
print()
print('Interpretation: an unmeasured confounder would need an association of')
print(f'{ev_dml:.2f}x with BOTH treatment and outcome (on RR scale) to explain away')
print('the DML estimate. The higher the E-value, the more robust our result.')
# Gamma-sensitivity (Rosenbaum bounds concept)
print('\nRosenbaum Gamma-sensitivity (how much hidden bias can our matched study tolerate?):')
print(f' Gamma=1.0 means no hidden bias (randomized trial equivalent)')
print(f' Gamma=1.5 means one hidden covariate can make treated 1.5x more likely to be treated')
gammas = [1.0, 1.2, 1.5, 2.0, 3.0]
for g in gammas:
# Under gamma, the max p-value for a one-sided sign test
n_pairs = len(treated_idx)
diffs = Y_treated - Y_control
n_pos = (diffs > 0).sum()
p_upper = g / (1 + g)
from scipy.stats import binom
p_val = 1 - binom.cdf(n_pos - 1, n_pairs, p_upper)
sig = 'β significant' if p_val < 0.05 else 'β not significant'
print(f' Gamma={g:.1f} max p-value={p_val:.4f} {sig}')
8. Cheat Sheet + ExercisesΒΆ
Causal Inference Cheat SheetΒΆ
Method |
Assumption |
When to Use |
Code |
|---|---|---|---|
Naive comparison |
No confounders |
RCT only |
|
OLS regression |
Linear, no unmeasured confounders |
Baseline |
|
PSM |
No unmeasured confounders, overlap |
Observational, low-dim X |
LogReg β NearestNeighbors |
IPW |
No unmeasured confounders, overlap |
When you want all units |
|
DML |
Partial linear model, no unmeasured confounders |
High-dim X, non-linear confounding |
Cross-fit ML residuals β OLS |
IV |
Exclusion restriction |
When unmeasured confounders exist |
2SLS |
# DML in 5 lines (conceptually)
Y_res = Y - ML_model.predict(X) # residualize outcome
T_res = T - PS_model.predict_proba(X)[:,1] # residualize treatment
ate = LinearRegression().fit(T_res.reshape(-1,1), Y_res).coef_[0]
# E-value
ev = rr + sqrt(rr * (rr - 1)) # sensitivity to unmeasured confounders
ExercisesΒΆ
Multiple confounders: Extend the synthetic dataset to have two confounders:
engagement(as before) andaccount_age(older accounts are more likely to receive emails AND buy more). Re-run PSM and DML including both confounders. Does adding account_age change the estimated ATE?Doubly robust estimator: Implement the Augmented IPW (AIPW) estimator β it is consistent if either the propensity model or the outcome model is correctly specified:
AIPW = IPW_term + regression_adjustment. Compare its variance to plain IPW.CATE (heterogeneous effects): Instead of an average effect, estimate conditional ATE using the DML framework with
sklearnβsGradientBoostingRegressoras the final stage. Plot CATE vs engagement score β do high-engagement users respond more to the email?Instrumental Variable: Suppose a/b test assignment was randomized (coin flip Z) but compliance was imperfect (some users ignored the email despite being assigned). Implement 2SLS using
Zas the instrument forTto estimate the LATE (Local Average Treatment Effect for compliers).Overlap (positivity) check: Propensity score methods break down when treated and control groups have no overlap. Plot the distribution of propensity scores for treated and control groups. Trim the sample to only units with
0.1 < e(X) < 0.9and re-estimate ATE β does trimming change the estimate?