Run this notebook: Open in Colab Open in Kaggle

Chapter 13: Multiple Testing¶

The Multiple Testing Problem¶

Single Test¶

Conduct 1 hypothesis test
Significance level: α = 0.05
False positive rate: 5%
Acceptable!

Multiple Tests¶

Conduct m = 1000 tests
Each at α = 0.05
Expected false positives: 1000 × 0.05 = 50!
Problem!

Real-World Examples¶

Genomics:

Test 20,000 genes for association
At α = 0.05: expect 1,000 false positives

Medical Screening:

Test 100 biomarkers
At α = 0.05: expect 5 false positives

A/B Testing:

Test 50 website variants
At α = 0.05: expect 2-3 false positives

Key Concepts¶

Type I vs Type II Errors¶

Truth \ Decision	Reject H₀	Don’t Reject H₀
H₀ true	Type I Error (α)	Correct
H₀ false	Correct (Power)	Type II Error (β)

Multiple Testing Terminology¶

Given m hypothesis tests:

	H₀ True	H₀ False	Total
Reject H₀	V (false positives)	S (true positives)	R
Don’t Reject	U	T (false negatives)	m - R
Total	m₀	m - m₀	m

Error Rates¶

Family-Wise Error Rate (FWER): $$FWER = P(V \geq 1)$$ Probability of at least one false positive

False Discovery Rate (FDR): $$FDR = E\left[\frac{V}{R}\right]$$ Expected proportion of false positives among rejections

Methods Covered¶

Bonferroni Correction - Controls FWER (conservative)
Holm’s Method - Controls FWER (less conservative)
Benjamini-Hochberg - Controls FDR (popular)
Benjamini-Yekutieli - Controls FDR (dependencies)

13.1 Family-Wise Error Rate (FWER)¶

Bonferroni Correction¶

Method: Test each hypothesis at level $\alpha/m$

Guarantees: $FWER \leq \alpha$

Proof (union bound): $$FWER = P(V \geq 1) \leq \sum_{i=1}^m P(\text{Type I error}_i) = m \cdot \frac{\alpha}{m} = \alpha$$

Properties¶

✅ Simple to implement
✅ Guarantees FWER control
✅ Works for any dependency structure
❌ Very conservative (low power)
❌ Loses power as m increases

When to Use¶

Small number of tests (m < 20)
Need strong FWER control
Can’t afford any false positives

Holm’s Step-Down Method¶

Less conservative than Bonferroni:

Order p-values: $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}$
For $i = 1, 2, \ldots, m$:
- If $p_{(i)} \leq \frac{\alpha}{m - i + 1}$, reject and continue
- Otherwise, stop and don’t reject remaining

Uniformly more powerful than Bonferroni while maintaining FWER ≤ α

# Demonstration of Multiple Testing Problem
np.random.seed(42)

# Simulate m tests where ALL null hypotheses are true
m = 100
n = 50  # sample size per test
alpha = 0.05

# Generate data: all from standard normal (H0 is true)
p_values = []
for i in range(m):
    # Two groups from same distribution
    group1 = np.random.randn(n)
    group2 = np.random.randn(n)
    t_stat, p_val = stats.ttest_ind(group1, group2)
    p_values.append(p_val)

p_values = np.array(p_values)

# Apply different corrections
reject_uncorrected = p_values < alpha
reject_bonferroni = p_values < (alpha / m)
reject_holm, _, _, _ = multipletests(p_values, alpha=alpha, method='holm')

# Results
print("📊 Multiple Testing Demonstration (m=100 tests, ALL H₀ TRUE)")
print("="*70)
print(f"\n1. Uncorrected (α = {alpha}):")
print(f"   False positives: {reject_uncorrected.sum()}")
print(f"   Rate: {reject_uncorrected.sum()/m*100:.1f}% (expected ~5%)")

print(f"\n2. Bonferroni (α/m = {alpha/m:.6f}):")
print(f"   False positives: {reject_bonferroni.sum()}")
print(f"   Rate: {reject_bonferroni.sum()/m*100:.1f}%")

print(f"\n3. Holm's Method:")
print(f"   False positives: {reject_holm.sum()}")
print(f"   Rate: {reject_holm.sum()/m*100:.1f}%")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# P-value histogram
axes[0].hist(p_values, bins=20, alpha=0.7, edgecolor='black')
axes[0].axhline(m/20, color='red', linestyle='--', label='Expected (uniform)')
axes[0].axvline(alpha, color='green', linestyle='--', label=f'α = {alpha}')
axes[0].axvline(alpha/m, color='orange', linestyle='--', label=f'Bonferroni: α/m')
axes[0].set_xlabel('P-value')
axes[0].set_ylabel('Frequency')
axes[0].set_title('P-value Distribution (All H₀ True)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Comparison
methods = ['Uncorrected', 'Bonferroni', 'Holm']
false_pos = [reject_uncorrected.sum(), reject_bonferroni.sum(), reject_holm.sum()]
colors = ['red', 'green', 'blue']

axes[1].bar(methods, false_pos, color=colors, alpha=0.7, edgecolor='black')
axes[1].axhline(alpha * m, color='red', linestyle='--', 
               label=f'Expected (uncorrected): {alpha*m:.0f}')
axes[1].set_ylabel('Number of False Positives')
axes[1].set_title('False Positives by Method')
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n💡 Key Insight:")
print("   Without correction: ~5% false positive rate (as expected)")
print("   With Bonferroni: Strong control, but conservative")
print("   With Holm: Better power than Bonferroni, same FWER guarantee")

13.2 False Discovery Rate (FDR)¶

Motivation¶

FWER control is too conservative when:

m is large (genomics: m = 20,000)
Can tolerate some false positives
Care about proportion, not absolute number

FDR Definition¶

\[FDR = E\left[\frac{V}{\max(R, 1)}\right]\]

Interpretation: Among all discoveries, what fraction are false?

Benjamini-Hochberg (BH) Procedure¶

Controls FDR at level q:

Order p-values: $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}$
Find largest $i$ such that: $$p_{(i)} \leq \frac{i \cdot q}{m}$$
Reject hypotheses $1, 2, \ldots, i$

BH vs Bonferroni¶

Bonferroni threshold: $\alpha/m$ (constant)

BH threshold: $i \cdot q/m$ (increases with rank)

More lenient for lower p-values
Higher power

Properties¶

✅ More powerful than FWER methods
✅ Scalable to large m
✅ Intuitive interpretation
✅ Default in genomics/biostatistics
⚠️ Assumes independence or positive dependence

Benjamini-Yekutieli (BY)¶

More conservative, works under any dependency: $$p_{(i)} \leq \frac{i \cdot q}{m \cdot c(m)}$$

where $c(m) = \sum_{i=1}^m \frac{1}{i} \approx \log(m)$

# FDR Demonstration with Mix of True and False Nulls
np.random.seed(42)

m = 1000
m0 = 900  # 900 true nulls
m1 = 100  # 100 false nulls (true effects)
n = 30
effect_size = 0.8  # Cohen's d

p_values = []
true_nulls = []

# True nulls (no effect)
for i in range(m0):
    group1 = np.random.randn(n)
    group2 = np.random.randn(n)
    _, p_val = stats.ttest_ind(group1, group2)
    p_values.append(p_val)
    true_nulls.append(True)

# False nulls (true effect)
for i in range(m1):
    group1 = np.random.randn(n)
    group2 = np.random.randn(n) + effect_size
    _, p_val = stats.ttest_ind(group1, group2)
    p_values.append(p_val)
    true_nulls.append(False)

p_values = np.array(p_values)
true_nulls = np.array(true_nulls)

# Apply different methods
alpha = 0.05
reject_bonf, _, _, _ = multipletests(p_values, alpha=alpha, method='bonferroni')
reject_holm, _, _, _ = multipletests(p_values, alpha=alpha, method='holm')
reject_bh, _, _, _ = multipletests(p_values, alpha=alpha, method='fdr_bh')
reject_by, _, _, _ = multipletests(p_values, alpha=alpha, method='fdr_by')

# Calculate metrics
def calc_metrics(reject, true_nulls):
    tp = np.sum(reject & ~true_nulls)  # True positives
    fp = np.sum(reject & true_nulls)   # False positives
    fn = np.sum(~reject & ~true_nulls) # False negatives
    tn = np.sum(~reject & true_nulls)  # True negatives
    
    total_discoveries = reject.sum()
    fdr = fp / max(total_discoveries, 1)
    power = tp / (tp + fn) if (tp + fn) > 0 else 0
    
    return {
        'discoveries': total_discoveries,
        'tp': tp,
        'fp': fp,
        'fdr': fdr,
        'power': power
    }

metrics = {
    'Bonferroni': calc_metrics(reject_bonf, true_nulls),
    'Holm': calc_metrics(reject_holm, true_nulls),
    'BH (FDR)': calc_metrics(reject_bh, true_nulls),
    'BY (FDR)': calc_metrics(reject_by, true_nulls)
}

# Print results
print("📊 Multiple Testing: Mixed Scenario")
print(f"   Total tests: {m}")
print(f"   True nulls: {m0} (90%)")
print(f"   False nulls (true effects): {m1} (10%)")
print(f"   Effect size: {effect_size} (Cohen's d)")
print("\n" + "="*80)
print(f"{'Method':<15} {'Discoveries':<12} {'TP':<8} {'FP':<8} {'FDR':<10} {'Power'}")
print("="*80)

for method, m in metrics.items():
    print(f"{method:<15} {m['discoveries']:<12} {m['tp']:<8} {m['fp']:<8} "
          f"{m['fdr']:<10.3f} {m['power']:.3f}")

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# P-value histogram
axes[0, 0].hist(p_values[true_nulls], bins=50, alpha=0.5, label='True nulls', edgecolor='black')
axes[0, 0].hist(p_values[~true_nulls], bins=50, alpha=0.5, label='False nulls', edgecolor='black')
axes[0, 0].axvline(alpha, color='red', linestyle='--', label=f'α = {alpha}')
axes[0, 0].set_xlabel('P-value')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('P-value Distribution by True Status')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Discoveries
methods_list = list(metrics.keys())
discoveries = [metrics[m]['discoveries'] for m in methods_list]
axes[0, 1].bar(methods_list, discoveries, alpha=0.7, edgecolor='black')
axes[0, 1].axhline(m1, color='red', linestyle='--', label=f'True positives available: {m1}')
axes[0, 1].set_ylabel('Number of Discoveries')
axes[0, 1].set_title('Total Discoveries by Method')
axes[0, 1].set_xticklabels(methods_list, rotation=15, ha='right')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3, axis='y')

# FDR
fdrs = [metrics[m]['fdr'] for m in methods_list]
axes[1, 0].bar(methods_list, fdrs, alpha=0.7, color='orange', edgecolor='black')
axes[1, 0].axhline(alpha, color='red', linestyle='--', label=f'Target: {alpha}')
axes[1, 0].set_ylabel('False Discovery Rate')
axes[1, 0].set_title('Observed FDR')
axes[1, 0].set_xticklabels(methods_list, rotation=15, ha='right')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3, axis='y')

# Power
powers = [metrics[m]['power'] for m in methods_list]
axes[1, 1].bar(methods_list, powers, alpha=0.7, color='green', edgecolor='black')
axes[1, 1].set_ylabel('Power')
axes[1, 1].set_title('Statistical Power (Sensitivity)')
axes[1, 1].set_ylim([0, 1])
axes[1, 1].set_xticklabels(methods_list, rotation=15, ha='right')
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n💡 Key Observations:")
print("   • Bonferroni/Holm: Conservative, low FDR but also low power")
print("   • BH (FDR): More discoveries, higher power, FDR controlled")
print("   • BY: More conservative than BH, handles dependencies")
print("   • Trade-off: FWER (few false positives) vs FDR (more power)")

13.3 Method Comparison and Selection¶

Visual Comparison of Thresholds¶

For m tests with ordered p-values $p_{(1)} \leq \cdots \leq p_{(m)}$:

Bonferroni: Reject if $p_{(i)} \leq \alpha/m$

Holm: Reject if $p_{(i)} \leq \alpha/(m-i+1)$

BH (FDR): Reject if $p_{(i)} \leq i \cdot q/m$

# Visualize Different Thresholds
m = 100
alpha = 0.05
ranks = np.arange(1, m+1)

# Calculate thresholds
bonf_threshold = np.full(m, alpha/m)
holm_threshold = alpha / (m - ranks + 1)
bh_threshold = ranks * alpha / m

plt.figure(figsize=(12, 6))
plt.plot(ranks, bonf_threshold, 'r-', linewidth=2, label='Bonferroni')
plt.plot(ranks, holm_threshold, 'g-', linewidth=2, label='Holm')
plt.plot(ranks, bh_threshold, 'b-', linewidth=2, label='Benjamini-Hochberg')
plt.xlabel('Rank (i)')
plt.ylabel('Threshold')
plt.title(f'Rejection Thresholds by Method (m={m}, α={alpha})')
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim([0, max(bh_threshold) * 1.1])
plt.show()

print("📊 Threshold Comparison:")
print(f"   Bonferroni: Constant at {alpha/m:.6f}")
print(f"   Holm: Decreases from {alpha/m:.6f} to {alpha:.6f}")
print(f"   BH: Increases from {alpha/m:.6f} to {alpha:.6f}")
print("\n💡 BH allows more discoveries (increasing threshold)")
print("   Holm more powerful than Bonferroni (adaptive threshold)")

13.4 Practical Guidelines¶

When to Use Each Method¶

Scenario	Method	Reason
Few tests (m < 20)	Bonferroni	Simple, conservative OK
Need no false positives	Bonferroni/Holm	Strong FWER control
Genomics (large m)	BH (FDR)	Power matters, some FP OK
Exploratory analysis	BH (FDR)	Maximize discoveries
Confirmatory study	Bonferroni/Holm	Strong control
Unknown dependencies	BY (FDR)	Safe under any structure
Clinical trials	Bonferroni	Regulatory requirement

Choosing FWER vs FDR¶

Use FWER (Bonferroni/Holm) if:

Can’t tolerate ANY false positives
Small number of tests
Confirmatory/regulatory setting
Follow-up is expensive

Use FDR (BH/BY) if:

Large number of tests (m > 100)
Exploratory analysis
Some false positives acceptable
Want to maximize true discoveries
Follow-up validation planned

Common Mistakes¶

❌ Not correcting at all when m is large
❌ Using Bonferroni for large m (too conservative)
❌ Confusing FWER and FDR
❌ Cherry-picking significant results
❌ Ignoring multiple testing in exploratory work
❌ Over-correcting (correcting twice)

Best Practices¶

✅ Pre-specify correction method
✅ Report both raw and adjusted p-values
✅ State clearly what is being controlled (FWER or FDR)
✅ Use FDR for large-scale screening
✅ Use FWER for confirmatory tests
✅ Consider context (exploratory vs confirmatory)

Reporting¶

Good reporting includes:

Number of tests conducted (m)
Correction method used
Target error rate (α or q)
Both raw and adjusted p-values
Number of discoveries
Estimated FDR or FWER

# Comprehensive Simulation Study
# Vary: m (number of tests) and π₀ (proportion of true nulls)

def simulate_multiple_testing(m, m0, n=30, effect_size=0.8, n_sim=100):
    """Simulate multiple testing scenario"""
    results = {'bonf': [], 'holm': [], 'bh': [], 'by': []}
    
    for _ in range(n_sim):
        p_values = []
        true_nulls = []
        
        # True nulls
        for i in range(m0):
            g1 = np.random.randn(n)
            g2 = np.random.randn(n)
            _, p = stats.ttest_ind(g1, g2)
            p_values.append(p)
            true_nulls.append(True)
        
        # False nulls
        for i in range(m - m0):
            g1 = np.random.randn(n)
            g2 = np.random.randn(n) + effect_size
            _, p = stats.ttest_ind(g1, g2)
            p_values.append(p)
            true_nulls.append(False)
        
        p_values = np.array(p_values)
        true_nulls = np.array(true_nulls)
        
        # Apply corrections
        reject_bonf, _, _, _ = multipletests(p_values, alpha=0.05, method='bonferroni')
        reject_holm, _, _, _ = multipletests(p_values, alpha=0.05, method='holm')
        reject_bh, _, _, _ = multipletests(p_values, alpha=0.05, method='fdr_bh')
        reject_by, _, _, _ = multipletests(p_values, alpha=0.05, method='fdr_by')
        
        # Store results
        for method, reject in [('bonf', reject_bonf), ('holm', reject_holm), 
                              ('bh', reject_bh), ('by', reject_by)]:
            metrics = calc_metrics(reject, true_nulls)
            results[method].append(metrics)
    
    # Average across simulations
    avg_results = {}
    for method in results:
        avg_results[method] = {
            'power': np.mean([r['power'] for r in results[method]]),
            'fdr': np.mean([r['fdr'] for r in results[method]]),
            'discoveries': np.mean([r['discoveries'] for r in results[method]])
        }
    
    return avg_results

# Run simulations
print("🔄 Running simulations (this may take a minute)...\n")

scenarios = [
    {'m': 100, 'm0': 90, 'label': 'm=100, 90% null'},
    {'m': 100, 'm0': 50, 'label': 'm=100, 50% null'},
    {'m': 1000, 'm0': 900, 'label': 'm=1000, 90% null'},
    {'m': 1000, 'm0': 500, 'label': 'm=1000, 50% null'}
]

all_results = []
for scenario in scenarios:
    result = simulate_multiple_testing(scenario['m'], scenario['m0'], n_sim=50)
    result['scenario'] = scenario['label']
    all_results.append(result)

# Plot results
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
methods = ['bonf', 'holm', 'bh', 'by']
method_labels = ['Bonferroni', 'Holm', 'BH', 'BY']
x = np.arange(len(scenarios))
width = 0.2

for i, method in enumerate(methods):
    # Power
    powers = [r[method]['power'] for r in all_results]
    axes[0].bar(x + i*width, powers, width, label=method_labels[i], alpha=0.8)
    
    # FDR
    fdrs = [r[method]['fdr'] for r in all_results]
    axes[1].bar(x + i*width, fdrs, width, label=method_labels[i], alpha=0.8)
    
    # Discoveries
    disc = [r[method]['discoveries'] for r in all_results]
    axes[2].bar(x + i*width, disc, width, label=method_labels[i], alpha=0.8)

axes[0].set_ylabel('Power')
axes[0].set_title('Statistical Power')
axes[0].set_xticks(x + 1.5*width)
axes[0].set_xticklabels([s['label'] for s in scenarios], rotation=15, ha='right', fontsize=9)
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis='y')

axes[1].set_ylabel('FDR')
axes[1].set_title('False Discovery Rate')
axes[1].axhline(0.05, color='red', linestyle='--', alpha=0.5, label='Target (0.05)')
axes[1].set_xticks(x + 1.5*width)
axes[1].set_xticklabels([s['label'] for s in scenarios], rotation=15, ha='right', fontsize=9)
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

axes[2].set_ylabel('Discoveries')
axes[2].set_title('Number of Discoveries')
axes[2].set_xticks(x + 1.5*width)
axes[2].set_xticklabels([s['label'] for s in scenarios], rotation=15, ha='right', fontsize=9)
axes[2].legend()
axes[2].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n💡 Simulation Insights:")
print("   • FDR methods (BH, BY) have higher power")
print("   • Bonferroni very conservative as m increases")
print("   • BH makes more discoveries while controlling FDR")
print("   • Trade-off between power and false positive control")

Key Takeaways¶

The Multiple Testing Problem¶

Core Issue:

Testing at α = 0.05 is fine for one test
With m tests, expect α × m false positives
Must adjust for multiple comparisons

Error Rate Control¶

FWER (Family-Wise Error Rate):

Probability of ≥ 1 false positive
Strong control
Conservative (low power)

FDR (False Discovery Rate):

Expected proportion of false positives
Less conservative
Higher power

Method Quick Reference¶

Bonferroni:

Test at α/m
Controls FWER
Very conservative
Use: Small m, confirmatory

Holm:

Step-down procedure
Controls FWER
More powerful than Bonferroni
Use: Small m, prefer over Bonferroni

Benjamini-Hochberg (BH):

Step-up procedure
Controls FDR
Good power
Use: Large m, exploratory, default choice

Benjamini-Yekutieli (BY):

Like BH but more conservative
Controls FDR under any dependency
Use: When dependencies unknown

Decision Tree¶

Multiple tests?
├─ No → Standard hypothesis test
└─ Yes → How many?
    ├─ Few (< 20) → Can tolerate FP?
    │   ├─ No → Bonferroni or Holm
    │   └─ Yes → BH (FDR)
    └─ Many (≥ 20) → Context?
        ├─ Confirmatory → Holm
        └─ Exploratory → BH (FDR) ← Most common

Python Implementation¶

from statsmodels.stats.multitest import multipletests

# Apply correction
reject, pvals_corrected, alphacSidak, alphacBonf = multipletests(
    pvals, 
    alpha=0.05, 
    method='fdr_bh'  # or 'bonferroni', 'holm', 'fdr_by'
)

# reject: boolean array of rejections
# pvals_corrected: adjusted p-values

Reporting Example¶

Good:

“We conducted 1,000 hypothesis tests. To control the false discovery rate at 5%, we applied the Benjamini-Hochberg procedure. This yielded 47 significant associations (adjusted p < 0.05).”

Bad:

“We found 200 significant results (p < 0.05).” ← No mention of correction!

Common Applications¶

Genomics:

Test 20,000 genes
Use BH (FDR)
Typical q = 0.05 or 0.10

Neuroimaging:

Test 100,000+ voxels
Use cluster-based corrections or FDR

A/B Testing:

Multiple metrics
Multiple segments
Use BH or Bonferroni depending on context

Clinical Trials:

Multiple endpoints
Multiple comparisons
Often use Bonferroni (regulatory)

Series Complete! 🎉¶

Chapters 2-13 of ISLP covered:

Statistical Learning
Linear Models
Classification
Resampling
Model Selection
Non-linearity
Tree Methods
Support Vector Machines
Deep Learning
Survival Analysis
Unsupervised Learning
Multiple Testing ✅

You now have a comprehensive foundation in statistical learning!

Practice Exercises¶

Exercise 1: Power Analysis¶

Simulate scenario with m = 100 tests:

Vary effect size: 0.2, 0.5, 0.8
Apply Bonferroni, Holm, BH
Plot power vs effect size for each method
Find minimum effect size for 80% power

Exercise 2: FDR Validation¶

Verify BH controls FDR:

Simulate 1000 experiments, each with m = 100 tests
90% true nulls, 10% false nulls
Apply BH at q = 0.05
Calculate observed FDR for each experiment
Show that mean FDR ≤ 0.05

Exercise 3: Real Data¶

Gene expression dataset:

Test all genes for differential expression
Apply no correction, Bonferroni, BH
Create volcano plot
Compare number of discoveries
Report findings properly

Exercise 4: Sample Size¶

For fixed m = 50 tests:

Vary n (sample size): 10, 30, 50, 100
Fixed effect size: 0.5
Apply BH procedure
Plot power vs sample size
Determine required n for 80% power

Exercise 5: Method Comparison¶

Comprehensive comparison:

Create 2×2 grid: m ∈ {100, 1000}, π₀ ∈ {0.5, 0.9}
For each scenario, compare all 4 methods
Report: power, FDR, discoveries
Recommend best method for each scenario

Exercise 6: Dependent Tests¶

Simul ate correlated p-values:

Generate multivariate normal data with correlation
Test multiple correlated hypotheses
Compare BH vs BY
Verify BY is more conservative
Assess impact of correlation on FDR control