Run this notebook: Open in Colab Open in Kaggle

03: Experimental Design¶

“To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.” - Ronald Fisher

Welcome to the gold standard of causal inference! This notebook explores experimental design - the most reliable way to establish causal relationships. We’ll cover Randomized Controlled Trials (RCTs), power analysis, and the principles that make experiments trustworthy.

🎯 Learning Objectives¶

By the end of this notebook, you’ll be able to:

Design and analyze Randomized Controlled Trials (RCTs)
Conduct power analysis for sample size determination
Apply blocking and stratification to reduce variance
Handle practical challenges in experimental design
Understand ethical considerations in experimentation
Evaluate experimental validity and potential biases

🥇 Randomized Controlled Trials (RCTs)¶

Randomized Controlled Trial (RCT): The gold standard for causal inference where:

Treatment assignment is random
Control group receives no treatment (or placebo)
Outcomes are measured for both groups
Analysis compares treated vs control

Why RCTs Work¶

Balance: Randomization balances both observed and unobserved confounders
Unbiased: No systematic differences between groups
Causal: Treatment precedes outcome, assignment is exogenous

RCT Components¶

Participants: Who is eligible?
Randomization: How are subjects assigned?
Intervention: What treatment is given?
Measurement: How are outcomes assessed?
Analysis: How are results interpreted?

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from statsmodels.stats.power import TTestIndPower
import warnings
warnings.filterwarnings('ignore')

# Set random seeds
np.random.seed(42)

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = [12, 8]
plt.rcParams['font.size'] = 12]

print("Experimental Design: Randomized Controlled Trials")
print("=================================================")

def simulate_rct():
    """Simulate a complete RCT from design to analysis"""
    
    print("=== Simulating a Complete RCT ===\n")
    
    # Study: Effect of online tutoring on student test scores
    print("Study: Effect of Online Tutoring on Student Test Scores")
print("-" * 55)
    
    n_students = 200
    
    # Generate student characteristics (potential confounders)
    baseline_score = np.random.normal(75, 10, n_students)  # Pre-test scores
    socioeconomic_status = np.random.normal(0, 1, n_students)
    study_hours = np.random.normal(5, 2, n_students)
    motivation = np.random.normal(0, 1, n_students)
    
    # True causal effect of tutoring
    true_tutoring_effect = 8  # 8 point improvement
    
    # Random assignment to treatment (50% chance)
    treatment_assignment = np.random.binomial(1, 0.5, n_students)
    
    # Generate post-test scores
    # All students improve based on their baseline and characteristics
    natural_improvement = (0.7 * baseline_score + 
                          2 * socioeconomic_status + 
                          1.5 * study_hours + 
                          3 * motivation)
    
    # Add treatment effect
    treatment_effect = true_tutoring_effect * treatment_assignment
    
    # Add random noise
    noise = np.random.normal(0, 5, n_students)
    
    post_test_score = natural_improvement + treatment_effect + noise
    
    # Create DataFrame
    rct_data = pd.DataFrame({
        'Student_ID': range(n_students),
        'Treatment': treatment_assignment,
        'Baseline_Score': baseline_score,
        'SES': socioeconomic_status,
        'Study_Hours': study_hours,
        'Motivation': motivation,
        'Post_Score': post_test_score,
        'Improvement': post_test_score - baseline_score
    })
    
    print(f"RCT Design Summary:")
print(f"- Total students: {n_students}")
print(f"- Treatment group: {treatment_assignment.sum()}")
print(f"- Control group: {n_students - treatment_assignment.sum()}")
print(f"- True treatment effect: {true_tutoring_effect} points")
print()
    
    # Check randomization balance
    print("Randomization Balance Check:")
    balance_vars = ['Baseline_Score', 'SES', 'Study_Hours', 'Motivation']
    
    for var in balance_vars:
        treated_mean = rct_data[rct_data['Treatment'] == 1][var].mean()
        control_mean = rct_data[rct_data['Treatment'] == 0][var].mean()
        diff = treated_mean - control_mean
        
        # T-test for difference
        t_stat, p_val = stats.ttest_ind(
            rct_data[rct_data['Treatment'] == 1][var],
            rct_data[rct_data['Treatment'] == 0][var]
        )
        
        print(f"{var:15}: Treated={treated_mean:.2f}, Control={control_mean:.2f}, "
              f"Diff={diff:.2f}, p={p_val:.3f}")
    
    print()
    
    # Analyze results
    treated_improvement = rct_data[rct_data['Treatment'] == 1]['Improvement'].mean()
    control_improvement = rct_data[rct_data['Treatment'] == 0]['Improvement'].mean()
    estimated_effect = treated_improvement - control_improvement
    
    # Statistical test
    t_stat, p_val = stats.ttest_ind(
        rct_data[rct_data['Treatment'] == 1]['Improvement'],
        rct_data[rct_data['Treatment'] == 0]['Improvement']
    )
    
    print("RCT Results:")
print(f"- Treated group improvement: {treated_improvement:.2f} points")
print(f"- Control group improvement: {control_improvement:.2f} points")
print(f"- Estimated treatment effect: {estimated_effect:.2f} points")
print(f"- True treatment effect: {true_tutoring_effect:.0f} points")
print(f"- T-statistic: {t_stat:.3f}, p-value: {p_val:.4f}")
print(f"- Statistically significant: {'Yes' if p_val < 0.05 else 'No'}")
print()
    
    # Visualize results
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
    
    # Distribution of baseline scores by group
    sns.histplot(data=rct_data, x='Baseline_Score', hue='Treatment', 
                alpha=0.7, ax=ax1)
    ax1.set_title('Baseline Score Distribution\n(Randomization Balance)')
    ax1.set_xlabel('Baseline Score')
    ax1.legend(['Control', 'Treated'])
    
    # Improvement by group
    sns.boxplot(data=rct_data, x='Treatment', y='Improvement', ax=ax2)
    ax2.set_title('Test Score Improvement by Group')
    ax2.set_xticklabels(['Control', 'Treated'])
    ax2.set_ylabel('Improvement (points)')
    
    # Scatter plot: baseline vs improvement
    sns.scatterplot(data=rct_data, x='Baseline_Score', y='Improvement', 
                   hue='Treatment', alpha=0.7, ax=ax3)
    ax3.set_title('Baseline Score vs Improvement')
    ax3.set_xlabel('Baseline Score')
    ax3.set_ylabel('Improvement (points)')
    
    # Effect size over time (simulated multiple measurements)
    # Simulate repeated measures
    time_points = ['Pre-test', 'Mid-test', 'Post-test']
    treated_scores = [75, 78, 83]  # With tutoring effect accumulating
    control_scores = [75, 76, 78]  # Natural improvement only
    
    ax4.plot(time_points, treated_scores, 'o-', linewidth=3, label='Treated', markersize=8)
    ax4.plot(time_points, control_scores, 's-', linewidth=3, label='Control', markersize=8)
    ax4.set_title('Learning Trajectories')
    ax4.set_ylabel('Test Score')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("🔑 RCT Key Insights:")
print("1. Randomization creates balanced groups")
print("2. Simple comparison gives unbiased causal estimate")
print("3. Effect is precisely estimated and statistically significant")
print("4. Results are interpretable: tutoring improves scores by ~8 points")
print("5. Confidence intervals quantify uncertainty")
    
    return rct_data

# Simulate a complete RCT
rct_results = simulate_rct()

⚡ Power Analysis & Sample Size¶

Power Analysis: Determines the sample size needed to detect an effect of a given size with sufficient statistical power.

Key Concepts¶

Effect Size: Standardized measure of treatment impact
Power (1-β): Probability of detecting a true effect
Significance Level (α): Probability of false positive
Sample Size (n): Number of subjects needed

Power Analysis Formula¶

\[n = \frac{(z_{1-\alpha/2} + z_{1-\beta})^2 \cdot \sigma^2}{d^2}\]

Where:

\(z_{1-\alpha/2}\): Critical value for significance level
\(z_{1-\beta}\): Critical value for power
\(\sigma\): Standard deviation of outcome
\(d\): Effect size (Cohen’s d)

def power_analysis_demo():
    """Demonstrate power analysis for sample size determination"""
    
    print("=== Power Analysis & Sample Size Determination ===\n")
    
    # Example: Testing a new drug for blood pressure reduction
    print("Example: Drug Trial for Blood Pressure Reduction")
print("-" * 45)
    
    # Study parameters
    effect_size = 0.5  # Cohen's d (moderate effect)
    alpha = 0.05       # Significance level
    power = 0.8        # Desired power (80%)
    sigma = 10         # Standard deviation of blood pressure
    
    # Calculate required sample size
    z_alpha = stats.norm.ppf(1 - alpha/2)  # Two-tailed test
    z_beta = stats.norm.ppf(power)
    
    n_per_group = ((z_alpha + z_beta) ** 2 * sigma ** 2) / (effect_size * sigma) ** 2
    n_per_group = int(np.ceil(n_per_group))
    total_n = 2 * n_per_group
    
    print(f"Study Parameters:")
print(f"- Effect size (Cohen's d): {effect_size}")
print(f"- Significance level (α): {alpha}")
print(f"- Desired power (1-β): {power}")
print(f"- Outcome SD (σ): {sigma} mmHg")
print()
    
    print(f"Sample Size Calculation:")
print(f"- Subjects per group: {n_per_group}")
print(f"- Total subjects needed: {total_n}")
print()
    
    # Demonstrate power curves
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
    
    # Power vs sample size for different effect sizes
    effect_sizes = [0.2, 0.5, 0.8]  # Small, medium, large
    sample_sizes = np.arange(10, 201, 10)
    
    for es in effect_sizes:
        powers = []
        for n in sample_sizes:
            analysis = TTestIndPower()
            power_val = analysis.solve_power(effect_size=es, nobs1=n, alpha=alpha)
            powers.append(power_val)
        
        ax1.plot(sample_sizes, powers, 'o-', label=f'Effect size = {es}', markersize=4)
    
    ax1.axhline(y=0.8, color='red', linestyle='--', alpha=0.7, label='Target power (80%)')
    ax1.set_xlabel('Sample Size per Group')
    ax1.set_ylabel('Statistical Power')
    ax1.set_title('Power vs Sample Size\n(Different Effect Sizes)')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Power vs effect size for fixed sample size
    effect_sizes_range = np.linspace(0.1, 1.0, 50)
    fixed_n = 50
    
    powers_fixed_n = []
    for es in effect_sizes_range:
        analysis = TTestIndPower()
        power_val = analysis.solve_power(effect_size=es, nobs1=fixed_n, alpha=alpha)
        powers_fixed_n.append(power_val)
    
    ax2.plot(effect_sizes_range, powers_fixed_n, 'b-', linewidth=2)
    ax2.axhline(y=0.8, color='red', linestyle='--', alpha=0.7)
    ax2.set_xlabel('Effect Size (Cohen\'s d)')
    ax2.set_ylabel('Statistical Power')
    ax2.set_title(f'Power vs Effect Size\n(n = {fixed_n} per group)')
    ax2.grid(True, alpha=0.3)
    
    # Required sample size vs effect size
    required_n = []
    for es in effect_sizes_range:
        analysis = TTestIndPower()
        n = analysis.solve_power(effect_size=es, power=power, alpha=alpha)
        required_n.append(n)
    
    ax3.plot(effect_sizes_range, required_n, 'g-', linewidth=2)
    ax3.set_xlabel('Effect Size (Cohen\'s d)')
    ax3.set_ylabel('Required Sample Size per Group')
    ax3.set_title('Sample Size Requirements\n(Power = 80%, α = 0.05)')
    ax3.grid(True, alpha=0.3)
    
    # Simulate underpowered vs properly powered study
    np.random.seed(42)
    
    # Underpowered study (n=20 per group)
    n_small = 20
    control_small = np.random.normal(120, 10, n_small)
    treated_small = np.random.normal(120 - effect_size * 10, 10, n_small)  # 5 mmHg effect
    
    # Properly powered study (n=64 per group)
    n_large = 64
    control_large = np.random.normal(120, 10, n_large)
    treated_large = np.random.normal(120 - effect_size * 10, 10, n_large)
    
    # Effect estimates
    effect_small = np.mean(control_small) - np.mean(treated_small)
    effect_large = np.mean(control_large) - np.mean(treated_large)
    
    # Confidence intervals
    se_small = np.sqrt(np.var(control_small)/n_small + np.var(treated_small)/n_small)
    se_large = np.sqrt(np.var(control_large)/n_large + np.var(treated_large)/n_large)
    
    ci_small = (effect_small - 1.96*se_small, effect_small + 1.96*se_small)
    ci_large = (effect_large - 1.96*se_large, effect_large + 1.96*se_large)
    
    # Plot
    studies = ['Underpowered\n(n=20)', 'Properly Powered\n(n=64)']
    effects = [effect_small, effect_large]
    errors = [1.96*se_small, 1.96*se_large]
    
    ax4.errorbar(range(len(studies)), effects, yerr=errors, fmt='o', 
                capsize=5, markersize=8, linewidth=2)
    ax4.axhline(y=effect_size * 10, color='red', linestyle='--', 
               label=f'True effect ({effect_size*10} mmHg)')
    ax4.set_xticks(range(len(studies)))
    ax4.set_xticklabels(studies)
    ax4.set_ylabel('Blood Pressure Reduction (mmHg)')
    ax4.set_title('Effect Estimation\n(Confidence Intervals)')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("📊 Power Analysis Insights:")
print("1. Larger effect sizes require smaller samples")
print("2. Small effects (d=0.2) need ~400 subjects per group")
print("3. Underpowered studies have wide confidence intervals")
print("4. Properly powered studies give precise estimates")
print("5. Power analysis prevents wasted resources")
print()
    
    # Interactive power calculator
    print("Power Calculator Examples:")
print("-" * 25)
    
    scenarios = [
        {'effect': 0.2, 'power': 0.8, 'alpha': 0.05, 'desc': 'Small effect, standard parameters'},
        {'effect': 0.5, 'power': 0.9, 'alpha': 0.01, 'desc': 'Medium effect, conservative test'},
        {'effect': 0.8, 'power': 0.95, 'alpha': 0.05, 'desc': 'Large effect, high power'}
    ]
    
    for scenario in scenarios:
        analysis = TTestIndPower()
        n = analysis.solve_power(
            effect_size=scenario['effect'], 
            power=scenario['power'], 
            alpha=scenario['alpha']
        )
        print(f"{scenario['desc']}:")
print(f"  Effect size: {scenario['effect']}, Power: {scenario['power']}, α: {scenario['alpha']}")
print(f"  Required n per group: {int(np.ceil(n))}")
print()
    
    return {
        'required_n': n_per_group,
        'effect_sizes': effect_sizes,
        'sample_sizes': sample_sizes
    }

# Demonstrate power analysis
power_results = power_analysis_demo()

🎯 Blocking & Stratification¶

Blocking: Grouping similar experimental units together before randomization. Stratification: Ensuring equal representation across important subgroups.

Why Block/Stratify?¶

Reduce variance: More precise effect estimates
Balance covariates: Ensure groups are comparable
Increase power: Better control of confounding
Subgroup analysis: Effects in different populations

Blocking vs Covariate Adjustment¶

Blocking: Done before randomization
Covariate adjustment: Done during analysis
Both valid: Blocking often more effective

def blocking_stratification_demo():
    """Demonstrate blocking and stratification in experimental design"""
    
    print("=== Blocking & Stratification in Experimental Design ===\n")
    
    # Example: Educational intervention study
    print("Example: Educational Intervention Study")
print("-" * 38)
    
    n_students = 240
    
    # Create student characteristics for blocking
    school_type = np.random.choice(['Urban', 'Suburban', 'Rural'], n_students, 
                                  p=[0.4, 0.4, 0.2])
    prior_achievement = np.random.choice(['Low', 'Medium', 'High'], n_students, 
                                        p=[0.3, 0.4, 0.3])
    
    # Create blocks: combination of school type and achievement
    blocks = [f"{school}_{achiev}" for school, achiev in zip(school_type, prior_achievement)]
    
    # Count students per block
    block_counts = pd.Series(blocks).value_counts().sort_index()
    print("Students per Block:")
    for block, count in block_counts.items():
        print(f"  {block}: {count} students")
    print()
    
    # Randomize within blocks
    treatment_assignment = np.zeros(n_students, dtype=int)
    
    for block in block_counts.index:
        block_mask = np.array(blocks) == block
        block_size = block_counts[block]
        
        # Assign half to treatment within each block
        treated_in_block = np.random.choice(block_size, block_size // 2, replace=False)
        treatment_assignment[block_mask] = 0
        block_indices = np.where(block_mask)[0]
        treatment_assignment[block_indices[treated_in_block]] = 1
    
    # Generate outcomes
    # Base outcome depends on characteristics
    base_score = (
        (school_type == 'Urban') * 70 +
        (school_type == 'Suburban') * 75 +
        (school_type == 'Rural') * 72 +
        (prior_achievement == 'Low') * 0 +
        (prior_achievement == 'Medium') * 5 +
        (prior_achievement == 'High') * 10
    )
    
    # Treatment effect
    treatment_effect = 8 * treatment_assignment
    
    # Final score
    final_score = base_score + treatment_effect + np.random.normal(0, 5, n_students)
    
    # Create DataFrame
    blocked_data = pd.DataFrame({
        'Student_ID': range(n_students),
        'Block': blocks,
        'School_Type': school_type,
        'Prior_Achievement': prior_achievement,
        'Treatment': treatment_assignment,
        'Final_Score': final_score
    })
    
    print(f"Blocked RCT Results:")
print(f"- Total students: {n_students}")
print(f"- Treatment group: {treatment_assignment.sum()}")
print(f"- Control group: {n_students - treatment_assignment.sum()}")
print()
    
    # Check balance within blocks
    print("Balance Check by Block:")
    balance_check = blocked_data.groupby('Block').agg({
        'Treatment': ['count', 'sum', 'mean']
    }).round(3)
    
    balance_check.columns = ['Total', 'Treated', 'Treatment_Rate']
    print(balance_check)
print()
    
    # Compare blocked vs unblocked design
    
    # Simulate unblocked design (simple randomization)
    unblocked_treatment = np.random.binomial(1, 0.5, n_students)
    unblocked_scores = base_score + 8 * unblocked_treatment + np.random.normal(0, 5, n_students)
    
    # Calculate effects
    blocked_effect = (blocked_data[blocked_data['Treatment'] == 1]['Final_Score'].mean() - 
                     blocked_data[blocked_data['Treatment'] == 0]['Final_Score'].mean())
    
    unblocked_effect = (unblocked_scores[unblocked_treatment == 1].mean() - 
                       unblocked_scores[unblocked_treatment == 0].mean())
    
    # Calculate standard errors
    blocked_se = np.sqrt(
        blocked_data[blocked_data['Treatment'] == 1]['Final_Score'].var() / treatment_assignment.sum() +
        blocked_data[blocked_data['Treatment'] == 0]['Final_Score'].var() / (n_students - treatment_assignment.sum())
    )
    
    unblocked_se = np.sqrt(
        unblocked_scores[unblocked_treatment == 1].var() / unblocked_treatment.sum() +
        unblocked_scores[unblocked_treatment == 0].var() / (n_students - unblocked_treatment.sum())
    )
    
    print("Effect Estimation Comparison:")
print(f"Blocked design:   {blocked_effect:.2f} ± {1.96*blocked_se:.2f} (SE = {blocked_se:.2f})")
print(f"Unblocked design: {unblocked_effect:.2f} ± {1.96*unblocked_se:.2f} (SE = {unblocked_se:.2f})")
print(f"True effect: 8.00")
print()
    
    # Visualize blocking benefits
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
    
    # Distribution by school type
    school_balance = blocked_data.groupby(['School_Type', 'Treatment']).size().unstack()
    school_balance.plot(kind='bar', ax=ax1)
    ax1.set_title('Treatment Assignment by School Type\n(Blocked Design)')
    ax1.set_ylabel('Number of Students')
    ax1.legend(['Control', 'Treated'])
    
    # Distribution by achievement level
    achievement_balance = blocked_data.groupby(['Prior_Achievement', 'Treatment']).size().unstack()
    achievement_balance.plot(kind='bar', ax=ax2)
    ax2.set_title('Treatment Assignment by Achievement Level\n(Blocked Design)')
    ax2.set_ylabel('Number of Students')
    
    # Effect by subgroup
    subgroup_effects = []
    for school in ['Urban', 'Suburban', 'Rural']:
        school_data = blocked_data[blocked_data['School_Type'] == school]
        effect = (school_data[school_data['Treatment'] == 1]['Final_Score'].mean() - 
                 school_data[school_data['Treatment'] == 0]['Final_Score'].mean())
        subgroup_effects.append(effect)
    
    ax3.bar(['Urban', 'Suburban', 'Rural'], subgroup_effects)
    ax3.axhline(y=8, color='red', linestyle='--', label='True effect')
    ax3.set_title('Treatment Effect by School Type')
    ax3.set_ylabel('Effect Size')
    ax3.legend()
    
    # Precision comparison
    designs = ['Unblocked', 'Blocked']
    effects = [unblocked_effect, blocked_effect]
    errors = [1.96*unblocked_se, 1.96*blocked_se]
    
    ax4.errorbar(range(len(designs)), effects, yerr=errors, fmt='o', 
                capsize=8, markersize=10, linewidth=3)
    ax4.axhline(y=8, color='red', linestyle='--', label='True effect')
    ax4.set_xticks(range(len(designs)))
    ax4.set_xticklabels(designs)
    ax4.set_ylabel('Estimated Effect')
    ax4.set_title('Precision Comparison\n(Blocked vs Unblocked)')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("🔑 Blocking & Stratification Benefits:")
print("1. Ensures balance across important subgroups")
print("2. Reduces variance and increases precision")
print("3. Allows subgroup-specific effect estimation")
print("4. More efficient than covariate adjustment alone")
print("5. Protects against chance imbalances")
print()
    
    # Stratification example
    print("Stratification Example:")
print("-" * 22)
    
    # Ensure equal representation in each stratum
    strata_counts = blocked_data.groupby(['School_Type', 'Treatment']).size()
    print("Stratified counts:")
    print(strata_counts)
    print()
    
    print("Stratification ensures:")
print("- Equal treatment allocation within each stratum")
print("- Representative sample across all subgroups")
print("- Ability to estimate effects within subgroups")
print("- Better generalizability of results")
    
    return blocked_data

# Demonstrate blocking and stratification
blocked_results = blocking_stratification_demo()

⚠️ Practical Challenges & Ethical Considerations¶

Common Experimental Challenges:

Non-compliance: Subjects don’t follow treatment protocol
Attrition: Subjects drop out of the study
Contamination: Control group gets exposed to treatment
Measurement error: Outcomes not measured accurately
External events: Confounding events during study

Ethical Considerations:

Informed consent: Subjects understand risks and benefits
Risk minimization: Treatments should be safe
Equitable access: Fair treatment assignment
Right to withdraw: Subjects can leave anytime
Data privacy: Protect participant information

def experimental_challenges_demo():
    """Demonstrate common experimental challenges and solutions"""
    
    print("=== Practical Challenges in Experimental Design ===\n")
    
    # Example: Online education platform A/B test
    print("Example: Online Education Platform A/B Test")
print("-" * 42)
    
    n_users = 10000
    
    # Random assignment
    treatment = np.random.binomial(1, 0.5, n_users)
    
    # Simulate various challenges
    
    # 1. Non-compliance: Some treated users don't use the feature
    compliance_rate = 0.7  # 70% of treated users actually use it
    actual_treatment = treatment * np.random.binomial(1, compliance_rate, n_users)
    
    # 2. Attrition: Some users drop out
    attrition_rate = 0.1  # 10% drop out
    dropout = np.random.binomial(1, attrition_rate, n_users)
    observed = 1 - dropout
    
    # 3. Contamination: Some control users find the feature
    contamination_rate = 0.05  # 5% of controls get contaminated
    contamination = (1 - treatment) * np.random.binomial(1, contamination_rate, n_users)
    actual_treatment = np.maximum(actual_treatment, contamination)
    
    # 4. Measurement error
    true_engagement = 50 + 15 * actual_treatment + np.random.normal(0, 10, n_users)
    measured_engagement = true_engagement + np.random.normal(0, 5, n_users)  # Measurement error
    
    # 5. External events (holidays affect engagement)
    holiday_effect = np.random.choice([0, 10], n_users, p=[0.9, 0.1])  # 10% during holidays
    final_engagement = measured_engagement + holiday_effect
    
    # Create DataFrame
    challenge_data = pd.DataFrame({
        'User_ID': range(n_users),
        'Assigned_Treatment': treatment,
        'Actual_Treatment': actual_treatment,
        'Observed': observed,
        'Contamination': contamination,
        'True_Engagement': true_engagement,
        'Measured_Engagement': measured_engagement,
        'Final_Engagement': final_engagement,
        'Holiday_Effect': holiday_effect
    })
    
    # Only analyze observed users
    observed_data = challenge_data[challenge_data['Observed'] == 1]
    
    print(f"Study Summary:")
print(f"- Total users: {n_users}")
print(f"- Assigned to treatment: {treatment.sum()}")
print(f"- Actually received treatment: {actual_treatment.sum()}")
print(f"- Observed (no attrition): {observed.sum()}")
print(f"- Contaminated controls: {contamination.sum()}")
print()
    
    # Different analysis approaches
    
    # Intent-to-Treat (ITT): Analyze as assigned
    itt_effect = (observed_data[observed_data['Assigned_Treatment'] == 1]['Final_Engagement'].mean() - 
                  observed_data[observed_data['Assigned_Treatment'] == 0]['Final_Engagement'].mean())
    
    # Per-Protocol: Analyze actual treatment received
    pp_effect = (observed_data[observed_data['Actual_Treatment'] == 1]['Final_Engagement'].mean() - 
                 observed_data[observed_data['Actual_Treatment'] == 0]['Final_Engagement'].mean())
    
    # As-Treated: Include contamination
    at_effect = (observed_data[observed_data['Actual_Treatment'] == 1]['Final_Engagement'].mean() - 
                 observed_data[(observed_data['Actual_Treatment'] == 0) & (observed_data['Contamination'] == 0)]['Final_Engagement'].mean())
    
    print("Effect Estimates Under Different Analysis Approaches:")
print(f"Intent-to-Treat (ITT): {itt_effect:.2f} (conservative, includes non-compliance)")
print(f"Per-Protocol (PP):     {pp_effect:.2f} (excludes non-compliant, may be biased)")
print(f"As-Treated (AT):       {at_effect:.2f} (includes contamination, realistic)")
print(f"True effect: 15.00")
print()
    
    # Statistical tests
    from scipy import stats
    
    treated_itt = observed_data[observed_data['Assigned_Treatment'] == 1]['Final_Engagement']
    control_itt = observed_data[observed_data['Assigned_Treatment'] == 0]['Final_Engagement']
    
    t_stat, p_val = stats.ttest_ind(treated_itt, control_itt)
    
    print(f"Statistical Test (ITT):")
print(f"- T-statistic: {t_stat:.3f}")
print(f"- P-value: {p_val:.4f}")
print(f"- Significant: {'Yes' if p_val < 0.05 else 'No'}")
print()
    
    # Visualize challenges
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
    
    # Non-compliance
    compliance_by_assignment = challenge_data.groupby('Assigned_Treatment')['Actual_Treatment'].mean()
    compliance_by_assignment.plot(kind='bar', ax=ax1, color=['skyblue', 'lightcoral'])
    ax1.set_title('Non-Compliance\n(Assigned vs Actual Treatment)')
    ax1.set_xlabel('Assigned Treatment')
    ax1.set_ylabel('Actual Treatment Rate')
    ax1.set_xticklabels(['Control', 'Treatment'], rotation=0)
    
    # Attrition
    attrition_by_assignment = 1 - challenge_data.groupby('Assigned_Treatment')['Observed'].mean()
    attrition_by_assignment.plot(kind='bar', ax=ax2, color=['skyblue', 'lightcoral'])
    ax2.set_title('Attrition Rates\n(Differential Dropout)')
    ax2.set_xlabel('Assigned Treatment')
    ax2.set_ylabel('Attrition Rate')
    ax2.set_xticklabels(['Control', 'Treatment'], rotation=0)
    
    # Contamination
    contamination_rate = challenge_data[challenge_data['Assigned_Treatment'] == 0]['Contamination'].mean()
    ax3.bar(['Control Group'], [contamination_rate], color='orange')
    ax3.set_title('Contamination\n(Control Group Exposure)')
    ax3.set_ylabel('Contamination Rate')
    ax3.set_ylim(0, 0.1)
    
    # Effect estimates comparison
    methods = ['ITT', 'Per-Protocol', 'As-Treated']
    effects = [itt_effect, pp_effect, at_effect]
    ax4.bar(methods, effects, color=['blue', 'green', 'red'])
    ax4.axhline(y=15, color='black', linestyle='--', label='True effect')
    ax4.set_title('Effect Estimates\n(Different Analysis Methods)')
    ax4.set_ylabel('Estimated Effect')
    ax4.legend()
    
    plt.tight_layout()
    plt.show()
    
    print("🔧 Solutions to Experimental Challenges:")
print("1. Intent-to-Treat: Analyze as randomized (gold standard)")
print("2. Monitor compliance: Track actual treatment receipt")
print("3. Minimize attrition: Keep study engaging and worthwhile")
print("4. Prevent contamination: Use separate systems/platforms")
print("5. Robust measurement: Use validated, reliable measures")
print("6. Control external factors: Block on time, use covariates")
print()
    
    # Ethical considerations
    print("⚖️ Ethical Considerations:")
print("-" * 23)
    
    ethical_points = [
        "Informed Consent: Clear explanation of study purpose and risks",
        "Risk-Benefit Analysis: Benefits must outweigh potential harms",
        "Equitable Selection: Fair inclusion criteria, no discrimination",
        "Right to Withdraw: Participants can leave at any time",
        "Data Privacy: Secure storage and anonymized analysis",
        "Transparency: Clear reporting of methods and results",
        "Independent Review: Ethics committee approval required"
    ]
    
    for i, point in enumerate(ethical_points, 1):
        print(f"{i}. {point}")
    
    return challenge_data

# Demonstrate experimental challenges
challenge_results = experimental_challenges_demo()

🎯 Key Takeaways¶

1. RCTs are the Gold Standard¶

Randomization balances both observed and unobserved confounders
Simple analysis gives unbiased causal estimates
Results are directly interpretable

2. Power Analysis is Essential¶

Determines sample size needed for reliable results
Prevents underpowered (wasteful) or overpowered studies
Consider effect size, significance level, and desired power

3. Blocking & Stratification Improve Precision¶

Ensure balance across important subgroups
Reduce variance and increase statistical power
Enable subgroup-specific analyses

4. Practical Challenges are Common¶

Non-compliance, attrition, contamination affect results
Intent-to-Treat analysis maintains randomization benefits
Careful monitoring and robust design mitigate issues

5. Ethics are Paramount¶

Informed consent and risk minimization required
Equitable treatment and data privacy essential
Independent ethical review protects participants

🔍 Critical Thinking Questions¶

When would you choose an RCT over observational methods?
How does non-compliance affect causal interpretation?
What’s the difference between statistical and practical significance?
How can you minimize attrition in a long-term study?
What ethical issues arise in A/B testing of user interfaces?

🚀 Next Steps¶

Now that you understand experimental design, you’re ready for:

Matching Methods: Propensity score matching for observational data
Instrumental Variables: Natural experiments and exogenous variation
Regression Discontinuity: Quasi-experimental designs
Difference-in-Differences: Before-after comparisons

Remember: Good experimental design is the foundation of trustworthy causal inference. Always prioritize randomization, power, and ethics!

“The most important thing in an experiment is to know what you are trying to find out.” - Ludwig Wittgenstein