Run this notebook: Open in Colab Open in Kaggle

Causal Inference: Moving Beyond Correlation¶

Correlation tells you what. Causation tells you what to do. This notebook covers the Potential Outcomes framework, observational study pitfalls, propensity score matching, and instrumental variables — the statistical tools for making causal claims from non-experimental data.

Setup¶

import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')  # non-interactive backend
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
print('Libraries loaded.')

1. The Fundamental Problem of Causal Inference¶

Potential Outcomes Framework (Rubin):

For each unit \(i\), define:

\(Y_i(1)\) = outcome if unit receives treatment
\(Y_i(0)\) = outcome if unit does NOT receive treatment

The individual treatment effect: \(\tau_i = Y_i(1) - Y_i(0)\)

The fundamental problem: we only ever observe ONE of \(Y_i(1)\) or \(Y_i(0)\). The other is counterfactual — it never happened.

We can estimate the Average Treatment Effect (ATE): \(\tau = \mathbb{E}[Y(1) - Y(0)]\) under certain assumptions.

# Synthetic dataset: does a marketing email CAUSE higher purchase rates?
# Confounder: user engagement level (highly engaged users open emails AND buy more)

N = 2000
rng = np.random.default_rng(42)

# Confounder: engagement score (0-10)
engagement = rng.uniform(0, 10, N)

# True causal effect of email: +0.05 purchase probability
TRUE_EFFECT = 0.05

# Treatment assignment: biased by engagement (not random!)
# Higher engagement → more likely to receive email
p_treat = 1 / (1 + np.exp(-(0.5 * engagement - 2.5)))
treatment = rng.binomial(1, p_treat).astype(float)

# Potential outcomes
# Purchase probability depends on engagement (confounder) + treatment effect
p_purchase_0 = 1 / (1 + np.exp(-(0.4 * engagement - 2.0)))          # Y(0)
p_purchase_1 = np.clip(p_purchase_0 + TRUE_EFFECT, 0, 1)             # Y(1)

# Observed outcome: Y = Y(T)
purchase = np.where(treatment == 1,
                    rng.binomial(1, p_purchase_1),
                    rng.binomial(1, p_purchase_0)).astype(float)

df = pd.DataFrame({
    'engagement' : engagement,
    'treatment'  : treatment,
    'purchase'   : purchase,
    'p_Y0'       : p_purchase_0,
    'p_Y1'       : p_purchase_1,
})

print(f'Samples: {N}')
print(f'Treatment rate: {treatment.mean():.2f}')
print(f'Purchase rate (treated):   {purchase[treatment==1].mean():.3f}')
print(f'Purchase rate (control):   {purchase[treatment==0].mean():.3f}')
print(f'Naive ATE (biased):        {purchase[treatment==1].mean() - purchase[treatment==0].mean():.3f}')
print(f'TRUE causal effect (ATE):  {TRUE_EFFECT:.3f}')

The naive estimate is inflated because treated users (who received the email) were already more engaged — they would have bought more anyway. This is confounding.

2. Confounding — Simpson’s Paradox¶

Simpson’s Paradox: a trend that appears in several groups reverses when the groups are combined. This happens when a confounder determines both group membership and the outcome.

# Simpson's Paradox example
# Question: does Drug X cure patients?
# Confounder: disease severity (mild vs severe)

data_simpson = pd.DataFrame({
    'group'    : ['Drug', 'Drug', 'No Drug', 'No Drug'],
    'severity' : ['Mild', 'Severe', 'Mild', 'Severe'],
    'n_treated': [81, 192, 234, 55],
    'n_total'  : [87, 263, 270, 80],
})
data_simpson['recovery_rate'] = data_simpson['n_treated'] / data_simpson['n_total']

print('Recovery rates BY SEVERITY GROUP:')
print(data_simpson[['group', 'severity', 'n_treated', 'n_total', 'recovery_rate']].to_string(index=False))

# Aggregate (ignoring severity)
drug_total   = data_simpson[data_simpson['group'] == 'Drug'][['n_treated','n_total']].sum()
nodrug_total = data_simpson[data_simpson['group'] == 'No Drug'][['n_treated','n_total']].sum()

drug_rate   = drug_total['n_treated']   / drug_total['n_total']
nodrug_rate = nodrug_total['n_treated'] / nodrug_total['n_total']

print(f'\nAGGREGATED (ignoring severity):')
print(f'  Drug:    {drug_rate:.3f}')
print(f'  No Drug: {nodrug_rate:.3f}')
print(f'  Naive conclusion: Drug is WORSE? {drug_rate < nodrug_rate}')
print('\nWHY: Drug was given to sicker patients (higher severity). Controlling for severity,')
print('     drug HELPS in both groups — classic Simpsons Paradox!')

3. Directed Acyclic Graphs (DAGs) — When to Control for a Variable¶

DAGs help us decide which variables to include in our model. Three key patterns:

FORK (confounder):         CHAIN (mediator):          COLLIDER (inverted):    
  Z → T → Y                T → Z → Y                  T → Z ← Y
  Z → Y                                                                       

Control for Z?    YES      Control for Z?    NO        Control for Z?   NO!
(blocks backdoor)          (blocks the effect)        (opens spurious path)

# Our DAG: Engagement → Treatment → Purchase
#          Engagement → Purchase
# Engagement is a FORK / confounder — must control for it

# Collider example: conditioning on a collider creates spurious correlation
rng2 = np.random.default_rng(10)
n = 1000

# T and Y are independent
T_collider = rng2.normal(0, 1, n)     # treatment (cause 1)
Y_collider = rng2.normal(0, 1, n)     # outcome (cause 2)
Z_collider = T_collider + Y_collider + rng2.normal(0, 0.5, n)  # collider: caused by both

# Without conditioning on Z: T and Y are independent
corr_unconditional = np.corrcoef(T_collider, Y_collider)[0, 1]

# Conditioning on Z = 0 (collider): creates spurious correlation
mask_z = np.abs(Z_collider) < 0.3
corr_conditional = np.corrcoef(T_collider[mask_z], Y_collider[mask_z])[0, 1]

print('Collider bias demonstration:')
print(f'  Corr(T, Y) without conditioning on Z : {corr_unconditional:.4f}  ← near zero (independent)')
print(f'  Corr(T, Y) conditioning on Z ≈ 0     : {corr_conditional:.4f}  ← spurious correlation!')
print('\nLesson: NEVER condition on a collider — it opens a spurious backdoor path.')
print('\nOur marketing DAG:')
print('  Engagement (Z) --> Treatment (T) --> Purchase (Y)')
print('  Engagement (Z) -----------------> Purchase (Y)')
print('  Z is a confounder → must include it in our model')

4. Propensity Score Matching (PSM)¶

Propensity score: \(e(X) = P(T=1 | X)\) — the probability of receiving treatment given covariates.

Idea: Match each treated unit with a control unit that has a similar propensity score. After matching, treatment assignment is “as-good-as-random” conditional on the score.

Steps:

Estimate propensity scores with logistic regression
Match treated to controls with similar scores (nearest-neighbor)
Compare outcomes in matched sample

from sklearn.neighbors import NearestNeighbors

# Step 1: Estimate propensity scores
X = df[['engagement']].values
T = df['treatment'].values
Y = df['purchase'].values

ps_model = LogisticRegression(random_state=42)
ps_model.fit(X, T)
propensity_scores = ps_model.predict_proba(X)[:, 1]
df['propensity_score'] = propensity_scores

# Step 2: Nearest-neighbor matching (1:1, without replacement)
treated_idx  = np.where(T == 1)[0]
control_idx  = np.where(T == 0)[0]

ps_control = propensity_scores[control_idx].reshape(-1, 1)
ps_treated = propensity_scores[treated_idx].reshape(-1, 1)

nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(ps_control)
distances, matched_control_pos = nn.kneighbors(ps_treated)
matched_control_idx = control_idx[matched_control_pos.flatten()]

# Step 3: Estimate ATE on matched sample
Y_treated  = Y[treated_idx]
Y_control  = Y[matched_control_idx]
psm_ate    = (Y_treated - Y_control).mean()

# Covariate balance check
eng_treated_before = df.loc[treated_idx, 'engagement'].mean()
eng_control_before = df.loc[control_idx, 'engagement'].mean()
eng_treated_after  = df.loc[treated_idx, 'engagement'].mean()
eng_control_after  = df.loc[matched_control_idx, 'engagement'].mean()

print('=== Propensity Score Matching Results ===')
print(f'Naive ATE (no adjustment)   : {Y[T==1].mean() - Y[T==0].mean():.4f}  (biased upward)')
print(f'PSM ATE                     : {psm_ate:.4f}')
print(f'TRUE causal effect          : {TRUE_EFFECT:.4f}')
print(f'\nCovariate balance (engagement):')
print(f'  Before matching — treated: {eng_treated_before:.2f}, control: {eng_control_before:.2f}')
print(f'  After  matching — treated: {eng_treated_after:.2f}, control: {eng_control_after:.2f}')
print(f'  Max PS distance (caliper): {distances.max():.4f}')

5. Inverse Probability Weighting (IPW)¶

Instead of discarding unmatched units, reweight observations by the inverse of their propensity score:

\[\hat{\tau}_{IPW} = \frac{1}{N} \sum_i \left( \frac{T_i Y_i}{e(X_i)} - \frac{(1-T_i) Y_i}{1 - e(X_i)} \right)\]

Treated units with low propensity score get upweighted (they were unlikely to be treated, so they’re informative controls).

e = propensity_scores

# Clip propensity scores to avoid extreme weights
e_clipped = np.clip(e, 0.01, 0.99)

# IPW estimator
ipw_ate = np.mean(T * Y / e_clipped - (1 - T) * Y / (1 - e_clipped))

# Stabilized IPW (normalize weights)
w_treated = T / e_clipped
w_control = (1 - T) / (1 - e_clipped)

sipw_ate = (np.sum(w_treated * Y) / np.sum(w_treated) -
            np.sum(w_control * Y) / np.sum(w_control))

print('=== Inverse Probability Weighting ===')
print(f'IPW ATE (unstabilized)      : {ipw_ate:.4f}')
print(f'IPW ATE (stabilized)        : {sipw_ate:.4f}')
print(f'TRUE causal effect          : {TRUE_EFFECT:.4f}')

# Show effective sample sizes
ess_treated = np.sum(w_treated)**2 / np.sum(w_treated**2)
ess_control = np.sum(w_control)**2 / np.sum(w_control**2)
print(f'\nEffective sample sizes:')
print(f'  Treated n={T.sum():.0f}  ESS={ess_treated:.0f}')
print(f'  Control n={(1-T).sum():.0f} ESS={ess_control:.0f}')

6. Double Machine Learning (DML)¶

DML (Chernozhukov et al. 2018) is a modern approach that:

Uses ML to partial out the effect of confounders from both outcome and treatment
Then regresses the residuals to get the causal effect

This is Robinson’s (1988) partial linear model estimated with ML:

\[Y = \tau T + g(X) + \epsilon\]

Procedure:

Fit \(\hat{g}(X) = E[Y | X]\) with ML → compute \(\tilde{Y} = Y - \hat{g}(X)\)
Fit \(\hat{e}(X) = E[T | X]\) with ML → compute \(\tilde{T} = T - \hat{e}(X)\)
Regress \(\tilde{Y}\) on \(\tilde{T}\): the coefficient is \(\hat{\tau}_{DML}\)

Use cross-fitting to avoid overfitting bias.

from sklearn.model_selection import KFold
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier

def double_ml(X, T, Y, n_splits=5, random_state=42):
    """
    Double Machine Learning with cross-fitting.
    Returns estimated ATE.
    """
    n = len(Y)
    Y_res = np.zeros(n)  # residuals: Y - E[Y|X]
    T_res = np.zeros(n)  # residuals: T - E[T|X]

    kf = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)

    for fold, (train_idx, test_idx) in enumerate(kf.split(X)):
        X_tr, X_te = X[train_idx], X[test_idx]
        Y_tr, T_tr = Y[train_idx], T[train_idx]

        # Model 1: predict Y from X
        m_y = GradientBoostingRegressor(n_estimators=100, random_state=random_state)
        m_y.fit(X_tr, Y_tr)
        Y_res[test_idx] = Y[test_idx] - m_y.predict(X_te)

        # Model 2: predict T from X (propensity model)
        m_t = GradientBoostingClassifier(n_estimators=100, random_state=random_state)
        m_t.fit(X_tr, T_tr)
        T_res[test_idx] = T[test_idx] - m_t.predict_proba(X_te)[:, 1]

    # Final OLS on residuals
    T_res_2d = T_res.reshape(-1, 1)
    dml_model = LinearRegression(fit_intercept=False)
    dml_model.fit(T_res_2d, Y_res)
    ate = dml_model.coef_[0]

    # Bootstrap SE
    rng_b = np.random.default_rng(0)
    boot_ates = []
    for _ in range(200):
        idx = rng_b.integers(0, n, n)
        lr = LinearRegression(fit_intercept=False)
        lr.fit(T_res[idx].reshape(-1,1), Y_res[idx])
        boot_ates.append(lr.coef_[0])
    se = np.std(boot_ates)

    return ate, se

print('Running Double Machine Learning (5-fold cross-fitting)...')
dml_ate, dml_se = double_ml(X, T, Y)
print(f'\nDML ATE    : {dml_ate:.4f} ± {1.96*dml_se:.4f}  (95% CI)')
print(f'True ATE   : {TRUE_EFFECT:.4f}')
print(f'Naive ATE  : {Y[T==1].mean() - Y[T==0].mean():.4f}')
print(f'PSM ATE    : {psm_ate:.4f}')

7. Sensitivity Analysis — Robustness to Unmeasured Confounders¶

Question: How large would an unmeasured confounder need to be to explain away our estimated effect?

We use Rosenbaum’s sensitivity analysis concept (simplified): vary the assumed odds ratio \(\Gamma\) for hidden bias. If results hold for large \(\Gamma\), the study is robust.

# Simplified sensitivity analysis: E-value
# E-value: minimum strength of association (on risk ratio scale) that an
# unmeasured confounder would need to have with BOTH treatment and outcome
# to fully explain away the observed effect.

def e_value(rr_observed):
    """
    E-value for a risk ratio.
    VanderWeele & Ding (2017).
    """
    if rr_observed < 1:
        rr_observed = 1 / rr_observed
    return rr_observed + np.sqrt(rr_observed * (rr_observed - 1))

# Convert ATE (probability difference) to risk ratio
baseline_risk = Y[T == 0].mean()
treated_risk  = Y[T == 1].mean()
dml_risk      = baseline_risk + dml_ate

rr_naive = treated_risk  / baseline_risk
rr_dml   = dml_risk      / baseline_risk

ev_naive = e_value(rr_naive)
ev_dml   = e_value(rr_dml)

print('=== Sensitivity Analysis (E-values) ===')
print(f'Baseline purchase rate (control):  {baseline_risk:.3f}')
print(f'Observed risk ratio (naive):       {rr_naive:.3f}  → E-value: {ev_naive:.3f}')
print(f'DML-adjusted risk ratio:           {rr_dml:.3f}  → E-value: {ev_dml:.3f}')
print()
print('Interpretation: an unmeasured confounder would need an association of')
print(f'{ev_dml:.2f}x with BOTH treatment and outcome (on RR scale) to explain away')
print('the DML estimate. The higher the E-value, the more robust our result.')

# Gamma-sensitivity (Rosenbaum bounds concept)
print('\nRosenbaum Gamma-sensitivity (how much hidden bias can our matched study tolerate?):')
print(f'  Gamma=1.0 means no hidden bias (randomized trial equivalent)')
print(f'  Gamma=1.5 means one hidden covariate can make treated 1.5x more likely to be treated')
gammas = [1.0, 1.2, 1.5, 2.0, 3.0]
for g in gammas:
    # Under gamma, the max p-value for a one-sided sign test
    n_pairs  = len(treated_idx)
    diffs    = Y_treated - Y_control
    n_pos    = (diffs > 0).sum()
    p_upper  = g / (1 + g)
    from scipy.stats import binom
    p_val = 1 - binom.cdf(n_pos - 1, n_pairs, p_upper)
    sig = '✓ significant' if p_val < 0.05 else '✗ not significant'
    print(f'  Gamma={g:.1f}  max p-value={p_val:.4f}  {sig}')

8. Cheat Sheet + Exercises¶

Causal Inference Cheat Sheet¶

Method	Assumption	When to Use	Code
Naive comparison	No confounders	RCT only	`Y[T==1].mean() - Y[T==0].mean()`
OLS regression	Linear, no unmeasured confounders	Baseline	`sm.OLS(Y, sm.add_constant(np.c_[T, X]))`
PSM	No unmeasured confounders, overlap	Observational, low-dim X	LogReg → NearestNeighbors
IPW	No unmeasured confounders, overlap	When you want all units	`TY/e - (1-T)Y/(1-e)`
DML	Partial linear model, no unmeasured confounders	High-dim X, non-linear confounding	Cross-fit ML residuals → OLS
IV	Exclusion restriction	When unmeasured confounders exist	2SLS

# DML in 5 lines (conceptually)
Y_res = Y - ML_model.predict(X)          # residualize outcome
T_res = T - PS_model.predict_proba(X)[:,1]  # residualize treatment
ate   = LinearRegression().fit(T_res.reshape(-1,1), Y_res).coef_[0]

# E-value
ev = rr + sqrt(rr * (rr - 1))  # sensitivity to unmeasured confounders

Exercises¶

Multiple confounders: Extend the synthetic dataset to have two confounders: engagement (as before) and account_age (older accounts are more likely to receive emails AND buy more). Re-run PSM and DML including both confounders. Does adding account_age change the estimated ATE?
Doubly robust estimator: Implement the Augmented IPW (AIPW) estimator — it is consistent if either the propensity model or the outcome model is correctly specified: AIPW = IPW_term + regression_adjustment. Compare its variance to plain IPW.
CATE (heterogeneous effects): Instead of an average effect, estimate conditional ATE using the DML framework with sklearn’s GradientBoostingRegressor as the final stage. Plot CATE vs engagement score — do high-engagement users respond more to the email?
Instrumental Variable: Suppose a/b test assignment was randomized (coin flip Z) but compliance was imperfect (some users ignored the email despite being assigned). Implement 2SLS using Z as the instrument for T to estimate the LATE (Local Average Treatment Effect for compliers).
Overlap (positivity) check: Propensity score methods break down when treated and control groups have no overlap. Plot the distribution of propensity scores for treated and control groups. Trim the sample to only units with 0.1 < e(X) < 0.9 and re-estimate ATE — does trimming change the estimate?