Run this notebook: Open in Colab Open in Kaggle

Difference-in-Differences: Causal Inference from Natural Experiments¶

DiD is the most widely used causal inference method in economics and tech companies. When you can’t randomize, find a ‘natural experiment’ where some units were treated by accident. This notebook covers classic DiD, parallel trends, event studies, and synthetic control.

1. Setup — Synthetic Policy Evaluation¶

Scenario: 20 cities. 10 adopted a ride-sharing ban in month 13 (treatment group). 10 did not (control group). Monthly traffic accident counts for 24 months. Does the ban reduce accidents?

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

N_CITIES   = 20
N_MONTHS   = 24
TREAT_TIME = 13        # treatment starts in month 13 (months 13-24 are post)
N_TREATED  = 10
TRUE_EFFECT = -15.0    # ban reduces accidents by 15 per month

rng = np.random.default_rng(42)

cities = [f'city_{i:02d}' for i in range(N_CITIES)]
treated_cities = cities[:N_TREATED]
control_cities = cities[N_TREATED:]

# City-level baseline accident rates (fixed effects)
city_fe = rng.normal(100, 20, N_CITIES)   # mean 100 accidents/month

# Common time trend (same for all cities — parallel trends assumption)
time_trend = np.linspace(0, 10, N_MONTHS) + rng.normal(0, 1, N_MONTHS)

records = []
for c_idx, city in enumerate(cities):
    is_treated = city in treated_cities
    for t in range(1, N_MONTHS + 1):
        post = int(t >= TREAT_TIME)
        effect = TRUE_EFFECT if (is_treated and post) else 0.0
        accidents = (city_fe[c_idx] +
                     time_trend[t - 1] +
                     effect +
                     rng.normal(0, 5))
        records.append({
            'city'       : city,
            'month'      : t,
            'treated'    : int(is_treated),
            'post'       : post,
            'accidents'  : max(0.0, round(accidents, 2)),
        })

panel = pd.DataFrame(records)
panel['treat_post'] = panel['treated'] * panel['post']  # interaction term

print(f'Panel shape: {panel.shape}')
print(f'Treated cities : {N_TREATED}  |  Control cities: {N_CITIES - N_TREATED}')
print(f'Pre-treatment months: {TREAT_TIME - 1}  |  Post-treatment months: {N_MONTHS - TREAT_TIME + 1}')
print(f'True causal effect: {TRUE_EFFECT} accidents/month')
panel.head()

2. Classic DiD — The 2×2 Table¶

The DiD estimator uses a 2×2 table of group means:

	Pre-treatment	Post-treatment	Difference
Treated	$\bar{Y}_{T,pre}$	$\bar{Y}_{T,post}$	$\Delta_T$
Control	$\bar{Y}_{C,pre}$	$\bar{Y}_{C,post}$	$\Delta_C$
DiD			$\Delta_T - \Delta_C$

Key assumption: Parallel trends — in the absence of treatment, treated and control groups would have followed the same trend.

# 2×2 DiD table
def did_2x2(df):
    means = df.groupby(['treated', 'post'])['accidents'].mean().unstack()
    means.columns = ['Pre', 'Post']
    means.index   = ['Control', 'Treated']
    means['Diff (Post-Pre)'] = means['Post'] - means['Pre']
    return means

table = did_2x2(panel)
print('=== 2x2 DiD Table ===')
print(table.round(3))

delta_treated = table.loc['Treated', 'Diff (Post-Pre)']
delta_control = table.loc['Control', 'Diff (Post-Pre)']
did_estimate  = delta_treated - delta_control

print(f'\nΔ_treated (post - pre for treated group): {delta_treated:.3f}')
print(f'Δ_control (post - pre for control group): {delta_control:.3f}')
print(f'DiD = Δ_treated - Δ_control             : {did_estimate:.3f}')
print(f'True effect                              : {TRUE_EFFECT:.3f}')

3. Regression DiD¶

The regression formulation generalizes DiD to:

Control for covariates
Include unit and time fixed effects
Compute standard errors (including clustered SEs)

\[Y_{it} = \alpha + \beta_1 \text{Treated}_i + \beta_2 \text{Post}_t + \delta (\text{Treated}_i \times \text{Post}_t) + \epsilon_{it}\]

$\delta$ is the DiD estimate. With unit FE and time FE:

\[Y_{it} = \alpha_i + \gamma_t + \delta (\text{Treated}_i \times \text{Post}_t) + \epsilon_{it}\]

# --- Simple DiD regression (no fixed effects) ---
X_simple = sm.add_constant(panel[['treated', 'post', 'treat_post']])
model_simple = sm.OLS(panel['accidents'], X_simple).fit(
    cov_type='cluster',
    cov_kwds={'groups': panel['city']}
)

print('=== Simple DiD Regression (clustered SE by city) ===')
print(model_simple.summary().tables[1])
print(f'\nDiD coefficient (treat_post): {model_simple.params["treat_post"]:.3f}')
print(f'95% CI: [{model_simple.conf_int().loc["treat_post", 0]:.3f}, '
      f'{model_simple.conf_int().loc["treat_post", 1]:.3f}]')

# --- Two-way fixed effects DiD (unit FE + time FE) ---
# Add city and month dummies
panel_fe = panel.copy()
city_dummies  = pd.get_dummies(panel_fe['city'],  prefix='city',  drop_first=True)
month_dummies = pd.get_dummies(panel_fe['month'], prefix='month', drop_first=True)

X_fe = pd.concat([
    panel_fe[['treat_post']],
    city_dummies,
    month_dummies
], axis=1).astype(float)
X_fe = sm.add_constant(X_fe)

model_fe = sm.OLS(panel_fe['accidents'], X_fe).fit(
    cov_type='cluster',
    cov_kwds={'groups': panel_fe['city']}
)

twfe_coeff = model_fe.params['treat_post']
twfe_se    = model_fe.bse['treat_post']
twfe_ci    = model_fe.conf_int().loc['treat_post']

print('=== Two-Way Fixed Effects DiD ===')
print(f'DiD coefficient (treat_post): {twfe_coeff:.3f}  SE: {twfe_se:.3f}')
print(f'95% CI: [{twfe_ci[0]:.3f}, {twfe_ci[1]:.3f}]')
print(f'True effect: {TRUE_EFFECT:.3f}')
print(f'\nTWFE includes {city_dummies.shape[1]} city FE + {month_dummies.shape[1]} time FE')

4. Parallel Trends Assumption — Visualize and Test¶

The parallel trends assumption cannot be directly tested (we never observe the counterfactual post-treatment trend). But we can check it pre-treatment: treated and control groups should follow the same trend before treatment begins.

# Visual check: pre-treatment trends
pre_trends = panel[panel['post'] == 0].groupby(['month', 'treated'])['accidents'].mean().unstack()
pre_trends.columns = ['Control', 'Treated']

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Pre-treatment trends
ax = axes[0]
ax.plot(pre_trends.index, pre_trends['Control'], 'b-o', label='Control', ms=4)
ax.plot(pre_trends.index, pre_trends['Treated'], 'r-s', label='Treated', ms=4)
ax.set_title('Pre-treatment Trends (Parallel Trends Check)')
ax.set_xlabel('Month'); ax.set_ylabel('Mean Accidents')
ax.legend(); ax.grid(True, alpha=0.3)

# Full time series with treatment line
full_trends = panel.groupby(['month', 'treated'])['accidents'].mean().unstack()
full_trends.columns = ['Control', 'Treated']
ax2 = axes[1]
ax2.plot(full_trends.index, full_trends['Control'], 'b-o', label='Control', ms=4)
ax2.plot(full_trends.index, full_trends['Treated'], 'r-s', label='Treated', ms=4)
ax2.axvline(x=TREAT_TIME - 0.5, color='k', linestyle='--', label='Treatment starts')
ax2.set_title('Full Panel: DiD Visualization')
ax2.set_xlabel('Month'); ax2.set_ylabel('Mean Accidents')
ax2.legend(); ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('/tmp/did_trends.png', dpi=80, bbox_inches='tight')
plt.close()
print('Plot saved to /tmp/did_trends.png')

# Formal test: is the pre-trend slope different between groups?
pre_panel = panel[panel['post'] == 0].copy()
pre_panel['month_x_treated'] = pre_panel['month'] * pre_panel['treated']
X_pt = sm.add_constant(pre_panel[['month', 'treated', 'month_x_treated']])
model_pt = sm.OLS(pre_panel['accidents'], X_pt).fit(
    cov_type='cluster', cov_kwds={'groups': pre_panel['city']}
)
pval_interaction = model_pt.pvalues['month_x_treated']
print(f'\nFormal parallel trends test (pre-period):')
print(f'  H0: slope is the same for treated and control (parallel trends)')
print(f'  Coefficient on month × treated: {model_pt.params["month_x_treated"]:.4f}')
print(f'  p-value: {pval_interaction:.4f}  → {"FAIL to reject H0 ✓" if pval_interaction > 0.05 else "REJECT H0 — parallel trends violated!"}')

5. Event Study — Plot Coefficients for Each Time Period¶

An event study regression includes a separate interaction for each time period relative to treatment. If parallel trends holds, pre-treatment coefficients should be near zero and not statistically significant.

# Event study: interact treatment indicator with time dummies
# Reference period: month just before treatment (month TREAT_TIME - 1)
panel_es = panel.copy()
panel_es['event_time'] = panel_es['month'] - TREAT_TIME  # relative time: -12 to +11

# Create dummies for each event-time period, omit reference period (event_time = -1)
event_times = sorted(panel_es['event_time'].unique())
event_times_no_ref = [t for t in event_times if t != -1]

dummy_cols = {}
for t in event_times_no_ref:
    col = f'et_{t:+d}'.replace('+', 'p').replace('-', 'm')
    panel_es[col] = ((panel_es['event_time'] == t) & (panel_es['treated'] == 1)).astype(float)
    dummy_cols[t] = col

# Also add city and month FE
city_dummies_es  = pd.get_dummies(panel_es['city'],  prefix='c', drop_first=True)
month_dummies_es = pd.get_dummies(panel_es['month'], prefix='m', drop_first=True)

et_cols = list(dummy_cols.values())
X_es = pd.concat([
    panel_es[et_cols],
    city_dummies_es,
    month_dummies_es
], axis=1).astype(float)
X_es = sm.add_constant(X_es)

model_es = sm.OLS(panel_es['accidents'], X_es).fit(
    cov_type='cluster', cov_kwds={'groups': panel_es['city']}
)

# Extract event-study coefficients and CIs
es_results = []
for t, col in dummy_cols.items():
    coef = model_es.params[col]
    ci   = model_es.conf_int().loc[col]
    es_results.append({'event_time': t, 'coef': coef, 'ci_lo': ci[0], 'ci_hi': ci[1]})
# Add reference period (0 by definition)
es_results.append({'event_time': -1, 'coef': 0.0, 'ci_lo': 0.0, 'ci_hi': 0.0})
es_df = pd.DataFrame(es_results).sort_values('event_time').reset_index(drop=True)

# Plot
fig, ax = plt.subplots(figsize=(10, 4))
ax.errorbar(es_df['event_time'], es_df['coef'],
            yerr=[es_df['coef'] - es_df['ci_lo'], es_df['ci_hi'] - es_df['coef']],
            fmt='o', color='steelblue', capsize=4, linewidth=1.5)
ax.axhline(0, color='black', linewidth=0.8, linestyle='--')
ax.axvline(-0.5, color='red', linewidth=1.5, linestyle='--', label='Treatment start')
ax.fill_between(es_df['event_time'].values,
                es_df['ci_lo'].values, es_df['ci_hi'].values,
                alpha=0.15, color='steelblue')
ax.set_xlabel('Months relative to treatment'); ax.set_ylabel('Coefficient estimate')
ax.set_title('Event Study: Pre-treatment coefs ≈ 0 (parallel trends), post-treatment shows effect')
ax.legend(); ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('/tmp/event_study.png', dpi=80, bbox_inches='tight')
plt.close()

pre_coefs  = es_df[es_df['event_time'] < -1]['coef']
post_coefs = es_df[es_df['event_time'] >= 0]['coef']
print('Event study results:')
print(f'  Pre-treatment coefficients (should be ~0): mean={pre_coefs.mean():.3f}, max_abs={pre_coefs.abs().max():.3f}')
print(f'  Post-treatment coefficients (should be ~{TRUE_EFFECT}): mean={post_coefs.mean():.3f}')
print('Plot saved to /tmp/event_study.png')

6. Synthetic Control¶

When you have only one treated unit, DiD is unreliable. Synthetic control constructs a weighted combination of control units that best matches the treated unit’s pre-treatment outcomes.

The synthetic counterfactual: $\hat{Y}_{1t}(0) = \sum_{j=2}^{J} w_j^* Y_{jt}$

Weights $w^*$ are found by solving a constrained optimization: minimize distance between pre-treatment outcomes of treated unit and synthetic control.

from scipy.optimize import minimize

# Use city_00 as the single treated unit; all other 19 cities as donors
donor_cities   = cities[1:]   # 19 donor cities
treated_city   = cities[0]     # city_00

# Build matrices: (N_MONTHS, N_DONORS) for donor outcomes, (N_MONTHS,) for treated
pivot = panel.pivot(index='month', columns='city', values='accidents')
Y_treated_full = pivot[treated_city].values                    # (24,)
Y_donors_full  = pivot[donor_cities].values                    # (24, 19)

# Pre-treatment period
pre_idx  = np.arange(0, TREAT_TIME - 1)     # months 1-12 (indices 0-11)
post_idx = np.arange(TREAT_TIME - 1, N_MONTHS)   # months 13-24

Y_tr_pre   = Y_treated_full[pre_idx]
Y_don_pre  = Y_donors_full[pre_idx, :]     # (12, 19)

# Optimization: find weights w (sum to 1, non-negative) minimizing pre-treatment fit
n_donors = len(donor_cities)

def sc_loss(w):
    synth = Y_don_pre @ w
    return np.sum((Y_tr_pre - synth) ** 2)

w0 = np.ones(n_donors) / n_donors   # equal weights init
constraints = [{'type': 'eq', 'fun': lambda w: w.sum() - 1}]
bounds = [(0, 1)] * n_donors

result = minimize(sc_loss, w0, method='SLSQP', bounds=bounds, constraints=constraints,
                  options={'ftol': 1e-10, 'maxiter': 2000})
w_opt = result.x

# Synthetic control series
Y_synth = Y_donors_full @ w_opt    # (24,) — synthetic counterfactual

# Estimated effect: treated - synthetic in post period
effects = Y_treated_full[post_idx] - Y_synth[post_idx]
sc_ate  = effects.mean()

print('=== Synthetic Control ===')
print(f'Pre-treatment RMSPE: {np.sqrt(sc_loss(w_opt)):.4f}')
print(f'Avg estimated effect (post): {sc_ate:.3f}')
print(f'True effect: {TRUE_EFFECT:.3f}')
print(f'\nTop-5 donor cities by weight:')
top_donors = np.argsort(w_opt)[::-1][:5]
for i in top_donors:
    if w_opt[i] > 0.01:
        print(f'  {donor_cities[i]}: {w_opt[i]:.4f}')

# Plot
months = np.arange(1, N_MONTHS + 1)
fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(months, Y_treated_full, 'r-o', label='Treated (city_00)', ms=4)
ax.plot(months, Y_synth, 'b--s',       label='Synthetic control', ms=4)
ax.axvline(TREAT_TIME - 0.5, color='k', linestyle='--', label='Treatment')
ax.fill_between(months[post_idx], Y_synth[post_idx], Y_treated_full[post_idx],
                alpha=0.2, color='red', label='Estimated effect')
ax.set_xlabel('Month'); ax.set_ylabel('Accidents')
ax.set_title('Synthetic Control: city_00 vs its synthetic counterfactual')
ax.legend(); ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('/tmp/synthetic_control.png', dpi=80, bbox_inches='tight')
plt.close()
print('\nPlot saved to /tmp/synthetic_control.png')

7. Staggered DiD — When Treatment is Adopted at Different Times¶

In practice, units often adopt treatment at different times (rollout). The classic TWFE estimator can be biased in staggered settings (Goodman-Bacon 2021). Modern estimators use “clean” comparisons only.

Callaway & Sant’Anna (2021) aggregates cohort-specific ATTs: $$ATT = \mathbb{E}[Y_{it}(g) - Y_{it}(\infty) | G_i = g, t \geq g]$$

# Generate staggered adoption dataset
N_CITIES_STAG = 30
N_MONTHS_STAG = 24
rng3 = np.random.default_rng(99)

# Each city gets a random treatment start (or never treated)
# 10 never treated, 10 treated early (month 7), 10 treated late (month 15)
city_cohorts = (['never'] * 10 +
                ['early'] * 10 +
                ['late']  * 10)
city_treat_time = {'never': None, 'early': 7, 'late': 15}
city_list_stag  = [f'scity_{i:02d}' for i in range(N_CITIES_STAG)]

city_fe_stag = rng3.normal(100, 20, N_CITIES_STAG)
time_trend_stag = np.linspace(0, 8, N_MONTHS_STAG)

stag_records = []
for c_idx, (city, cohort) in enumerate(zip(city_list_stag, city_cohorts)):
    treat_start = city_treat_time[cohort]
    for t in range(1, N_MONTHS_STAG + 1):
        is_treated_now = (treat_start is not None) and (t >= treat_start)
        effect = TRUE_EFFECT if is_treated_now else 0.0
        accidents = (city_fe_stag[c_idx] + time_trend_stag[t-1] + effect +
                     rng3.normal(0, 5))
        stag_records.append({
            'city': city, 'month': t, 'cohort': cohort,
            'treat_start': treat_start if treat_start else 0,
            'treated_now': int(is_treated_now),
            'accidents': max(0.0, round(accidents, 2))
        })

stag_panel = pd.DataFrame(stag_records)

# Naive TWFE on staggered panel
city_fe_s  = pd.get_dummies(stag_panel['city'],  prefix='c', drop_first=True)
month_fe_s = pd.get_dummies(stag_panel['month'], prefix='m', drop_first=True)
X_stag = pd.concat([stag_panel[['treated_now']], city_fe_s, month_fe_s], axis=1).astype(float)
X_stag = sm.add_constant(X_stag)
m_stag = sm.OLS(stag_panel['accidents'], X_stag).fit(
    cov_type='cluster', cov_kwds={'groups': stag_panel['city']}
)
naive_twfe_stag = m_stag.params['treated_now']

# Cohort-specific ATTs (clean DiD for each cohort vs never-treated)
cohort_atts = {}
never_treated = stag_panel[stag_panel['cohort'] == 'never']

for cohort_name in ['early', 'late']:
    treat_start_c = city_treat_time[cohort_name]
    cohort_data   = stag_panel[stag_panel['cohort'] == cohort_name]
    combined      = pd.concat([cohort_data, never_treated])
    combined      = combined.copy()
    combined['is_cohort'] = (combined['cohort'] == cohort_name).astype(int)
    combined['post']      = (combined['month'] >= treat_start_c).astype(int)
    combined['ip']        = combined['is_cohort'] * combined['post']
    X_c = sm.add_constant(combined[['is_cohort', 'post', 'ip']])
    m_c = sm.OLS(combined['accidents'], X_c).fit(
        cov_type='cluster', cov_kwds={'groups': combined['city']}
    )
    cohort_atts[cohort_name] = m_c.params['ip']

# Aggregate ATT (simple average of cohort ATTs)
agg_att = np.mean(list(cohort_atts.values()))

print('=== Staggered DiD ===')
print(f'Naive TWFE (can be biased with staggered adoption): {naive_twfe_stag:.3f}')
print(f"Cohort-specific ATT (early adopters, treat month {city_treat_time['early']}): {cohort_atts['early']:.3f}")
print(f"Cohort-specific ATT (late adopters,  treat month {city_treat_time['late']}):  {cohort_atts['late']:.3f}")
print(f'Aggregated ATT (Callaway & Sant\'Anna style)           : {agg_att:.3f}')
print(f'True effect                                            : {TRUE_EFFECT:.3f}')

8. Cheat Sheet + Exercises¶

Difference-in-Differences Cheat Sheet¶

Method	Formula	When to Use
Classic 2×2 DiD	$(\bar{Y}_{T,post} - \bar{Y}_{T,pre}) - (\bar{Y}_{C,post} - \bar{Y}_{C,pre})$	Two groups, two periods
Regression DiD	`Y ~ treated + post + treated:post` (OLS)	More flexibility, controls
TWFE	`Y ~ treated:post + C(city) + C(month)`	Panel data, multiple periods
Event Study	`Y ~ Σ_t (treated × 1[time=t]) + FE`	Visualize + test parallel trends
Synthetic Control	`Y_synth = W* Y_donors` (constrained optim)	Single treated unit
Staggered DiD	Cohort-specific ATTs + aggregation	Staggered rollouts

# TWFE regression (statsmodels)
import statsmodels.formula.api as smf

# Simple DiD
model = smf.ols('Y ~ treated * post', data=df).fit(
    cov_type='cluster', cov_kwds={'groups': df['unit']}
)
did_effect = model.params['treated:post']

# TWFE (with entity and time fixed effects via dummies or absorb)
# Alternatively with linearmodels:
# from linearmodels.panel import PanelOLS
# mod = PanelOLS(Y, X, entity_effects=True, time_effects=True)

Key assumptions:

Parallel trends — pre-treatment trends are the same
No anticipation — units don’t change behavior before treatment starts
SUTVA — treatment of one unit doesn’t affect others (no spillovers)

Exercises¶

Placebo test: To test parallel trends, pretend treatment happened 6 months earlier (in month 7 instead of 13) and re-run the DiD using only pre-treatment data (months 1-12). The DiD coefficient should be near zero. Report the coefficient and p-value.
Robustness to control set: Re-run the TWFE regression including a city-level covariate (e.g., simulated city population). Does adding the covariate change the DiD estimate? How large is the change, and why?
Permutation test for synthetic control: Estimate synthetic control separately for each of the 10 never-treated cities (as if they were the treated unit). Plot the distribution of placebo effects. Is the true treated unit’s effect an outlier? This is the standard significance test for synthetic control.
Goodman-Bacon decomposition: In the staggered dataset, the naive TWFE estimate is a weighted average of all 2×2 DiD comparisons (early vs never-treated, late vs never-treated, and early vs late). Implement the Goodman-Bacon decomposition manually: compute each 2×2 estimate and its weight, then verify the weighted sum equals the TWFE estimate.
Interrupted time series (ITS): When there is no control group, DiD reduces to ITS: fit a linear trend before and after treatment for the treated units only. Implement ITS on treated cities, compare the estimated effect to the DiD estimate, and discuss what additional assumptions ITS requires.

	Pre-treatment	Post-treatment	Difference
Treated	\(\bar{Y}_{T,pre}\)	\(\bar{Y}_{T,post}\)	\(\Delta_T\)
Control	\(\bar{Y}_{C,pre}\)	\(\bar{Y}_{C,post}\)	\(\Delta_C\)
DiD			\(\Delta_T - \Delta_C\)

Method	Formula	When to Use
Classic 2×2 DiD	\((\bar{Y}_{T,post} - \bar{Y}_{T,pre}) - (\bar{Y}_{C,post} - \bar{Y}_{C,pre})\)	Two groups, two periods
Regression DiD	`Y ~ treated + post + treated:post` (OLS)	More flexibility, controls
TWFE	`Y ~ treated:post + C(city) + C(month)`	Panel data, multiple periods
Event Study	`Y ~ Σ_t (treated × 1[time=t]) + FE`	Visualize + test parallel trends
Synthetic Control	`Y_synth = W* Y_donors` (constrained optim)	Single treated unit
Staggered DiD	Cohort-specific ATTs + aggregation	Staggered rollouts