Chapter 13: Multiple TestingΒΆ
The Multiple Testing ProblemΒΆ
Single TestΒΆ
Conduct 1 hypothesis test
Significance level: Ξ± = 0.05
False positive rate: 5%
Acceptable!
Multiple TestsΒΆ
Conduct m = 1000 tests
Each at Ξ± = 0.05
Expected false positives: 1000 Γ 0.05 = 50!
Problem!
Real-World ExamplesΒΆ
Genomics:
Test 20,000 genes for association
At Ξ± = 0.05: expect 1,000 false positives
Medical Screening:
Test 100 biomarkers
At Ξ± = 0.05: expect 5 false positives
A/B Testing:
Test 50 website variants
At Ξ± = 0.05: expect 2-3 false positives
Key ConceptsΒΆ
Type I vs Type II ErrorsΒΆ
Truth \ Decision |
Reject Hβ |
Donβt Reject Hβ |
|---|---|---|
Hβ true |
Type I Error (Ξ±) |
Correct |
Hβ false |
Correct (Power) |
Type II Error (Ξ²) |
Multiple Testing TerminologyΒΆ
Given m hypothesis tests:
Hβ True |
Hβ False |
Total |
|
|---|---|---|---|
Reject Hβ |
V (false positives) |
S (true positives) |
R |
Donβt Reject |
U |
T (false negatives) |
m - R |
Total |
mβ |
m - mβ |
m |
Error RatesΒΆ
Family-Wise Error Rate (FWER): $\(FWER = P(V \geq 1)\)$ Probability of at least one false positive
False Discovery Rate (FDR): $\(FDR = E\left[\frac{V}{R}\right]\)$ Expected proportion of false positives among rejections
Methods CoveredΒΆ
Bonferroni Correction - Controls FWER (conservative)
Holmβs Method - Controls FWER (less conservative)
Benjamini-Hochberg - Controls FDR (popular)
Benjamini-Yekutieli - Controls FDR (dependencies)
13.1 Family-Wise Error Rate (FWER)ΒΆ
Bonferroni CorrectionΒΆ
Method: Test each hypothesis at level \(\alpha/m\)
Guarantees: \(FWER \leq \alpha\)
Proof (union bound): $\(FWER = P(V \geq 1) \leq \sum_{i=1}^m P(\text{Type I error}_i) = m \cdot \frac{\alpha}{m} = \alpha\)$
PropertiesΒΆ
β
Simple to implement
β
Guarantees FWER control
β
Works for any dependency structure
β Very conservative (low power)
β Loses power as m increases
When to UseΒΆ
Small number of tests (m < 20)
Need strong FWER control
Canβt afford any false positives
Holmβs Step-Down MethodΒΆ
Less conservative than Bonferroni:
Order p-values: \(p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}\)
For \(i = 1, 2, \ldots, m\):
If \(p_{(i)} \leq \frac{\alpha}{m - i + 1}\), reject and continue
Otherwise, stop and donβt reject remaining
Uniformly more powerful than Bonferroni while maintaining FWER β€ Ξ±
# Demonstration of Multiple Testing Problem
np.random.seed(42)
# Simulate m tests where ALL null hypotheses are true
m = 100
n = 50 # sample size per test
alpha = 0.05
# Generate data: all from standard normal (H0 is true)
p_values = []
for i in range(m):
# Two groups from same distribution
group1 = np.random.randn(n)
group2 = np.random.randn(n)
t_stat, p_val = stats.ttest_ind(group1, group2)
p_values.append(p_val)
p_values = np.array(p_values)
# Apply different corrections
reject_uncorrected = p_values < alpha
reject_bonferroni = p_values < (alpha / m)
reject_holm, _, _, _ = multipletests(p_values, alpha=alpha, method='holm')
# Results
print("π Multiple Testing Demonstration (m=100 tests, ALL Hβ TRUE)")
print("="*70)
print(f"\n1. Uncorrected (Ξ± = {alpha}):")
print(f" False positives: {reject_uncorrected.sum()}")
print(f" Rate: {reject_uncorrected.sum()/m*100:.1f}% (expected ~5%)")
print(f"\n2. Bonferroni (Ξ±/m = {alpha/m:.6f}):")
print(f" False positives: {reject_bonferroni.sum()}")
print(f" Rate: {reject_bonferroni.sum()/m*100:.1f}%")
print(f"\n3. Holm's Method:")
print(f" False positives: {reject_holm.sum()}")
print(f" Rate: {reject_holm.sum()/m*100:.1f}%")
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# P-value histogram
axes[0].hist(p_values, bins=20, alpha=0.7, edgecolor='black')
axes[0].axhline(m/20, color='red', linestyle='--', label='Expected (uniform)')
axes[0].axvline(alpha, color='green', linestyle='--', label=f'Ξ± = {alpha}')
axes[0].axvline(alpha/m, color='orange', linestyle='--', label=f'Bonferroni: Ξ±/m')
axes[0].set_xlabel('P-value')
axes[0].set_ylabel('Frequency')
axes[0].set_title('P-value Distribution (All Hβ True)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Comparison
methods = ['Uncorrected', 'Bonferroni', 'Holm']
false_pos = [reject_uncorrected.sum(), reject_bonferroni.sum(), reject_holm.sum()]
colors = ['red', 'green', 'blue']
axes[1].bar(methods, false_pos, color=colors, alpha=0.7, edgecolor='black')
axes[1].axhline(alpha * m, color='red', linestyle='--',
label=f'Expected (uncorrected): {alpha*m:.0f}')
axes[1].set_ylabel('Number of False Positives')
axes[1].set_title('False Positives by Method')
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
print("\nπ‘ Key Insight:")
print(" Without correction: ~5% false positive rate (as expected)")
print(" With Bonferroni: Strong control, but conservative")
print(" With Holm: Better power than Bonferroni, same FWER guarantee")
13.2 False Discovery Rate (FDR)ΒΆ
MotivationΒΆ
FWER control is too conservative when:
m is large (genomics: m = 20,000)
Can tolerate some false positives
Care about proportion, not absolute number
FDR DefinitionΒΆ
Interpretation: Among all discoveries, what fraction are false?
Benjamini-Hochberg (BH) ProcedureΒΆ
Controls FDR at level q:
Order p-values: \(p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}\)
Find largest \(i\) such that: $\(p_{(i)} \leq \frac{i \cdot q}{m}\)$
Reject hypotheses \(1, 2, \ldots, i\)
BH vs BonferroniΒΆ
Bonferroni threshold: \(\alpha/m\) (constant)
BH threshold: \(i \cdot q/m\) (increases with rank)
More lenient for lower p-values
Higher power
PropertiesΒΆ
β
More powerful than FWER methods
β
Scalable to large m
β
Intuitive interpretation
β
Default in genomics/biostatistics
β οΈ Assumes independence or positive dependence
Benjamini-Yekutieli (BY)ΒΆ
More conservative, works under any dependency: $\(p_{(i)} \leq \frac{i \cdot q}{m \cdot c(m)}\)$
where \(c(m) = \sum_{i=1}^m \frac{1}{i} \approx \log(m)\)
# FDR Demonstration with Mix of True and False Nulls
np.random.seed(42)
m = 1000
m0 = 900 # 900 true nulls
m1 = 100 # 100 false nulls (true effects)
n = 30
effect_size = 0.8 # Cohen's d
p_values = []
true_nulls = []
# True nulls (no effect)
for i in range(m0):
group1 = np.random.randn(n)
group2 = np.random.randn(n)
_, p_val = stats.ttest_ind(group1, group2)
p_values.append(p_val)
true_nulls.append(True)
# False nulls (true effect)
for i in range(m1):
group1 = np.random.randn(n)
group2 = np.random.randn(n) + effect_size
_, p_val = stats.ttest_ind(group1, group2)
p_values.append(p_val)
true_nulls.append(False)
p_values = np.array(p_values)
true_nulls = np.array(true_nulls)
# Apply different methods
alpha = 0.05
reject_bonf, _, _, _ = multipletests(p_values, alpha=alpha, method='bonferroni')
reject_holm, _, _, _ = multipletests(p_values, alpha=alpha, method='holm')
reject_bh, _, _, _ = multipletests(p_values, alpha=alpha, method='fdr_bh')
reject_by, _, _, _ = multipletests(p_values, alpha=alpha, method='fdr_by')
# Calculate metrics
def calc_metrics(reject, true_nulls):
tp = np.sum(reject & ~true_nulls) # True positives
fp = np.sum(reject & true_nulls) # False positives
fn = np.sum(~reject & ~true_nulls) # False negatives
tn = np.sum(~reject & true_nulls) # True negatives
total_discoveries = reject.sum()
fdr = fp / max(total_discoveries, 1)
power = tp / (tp + fn) if (tp + fn) > 0 else 0
return {
'discoveries': total_discoveries,
'tp': tp,
'fp': fp,
'fdr': fdr,
'power': power
}
metrics = {
'Bonferroni': calc_metrics(reject_bonf, true_nulls),
'Holm': calc_metrics(reject_holm, true_nulls),
'BH (FDR)': calc_metrics(reject_bh, true_nulls),
'BY (FDR)': calc_metrics(reject_by, true_nulls)
}
# Print results
print("π Multiple Testing: Mixed Scenario")
print(f" Total tests: {m}")
print(f" True nulls: {m0} (90%)")
print(f" False nulls (true effects): {m1} (10%)")
print(f" Effect size: {effect_size} (Cohen's d)")
print("\n" + "="*80)
print(f"{'Method':<15} {'Discoveries':<12} {'TP':<8} {'FP':<8} {'FDR':<10} {'Power'}")
print("="*80)
for method, m in metrics.items():
print(f"{method:<15} {m['discoveries']:<12} {m['tp']:<8} {m['fp']:<8} "
f"{m['fdr']:<10.3f} {m['power']:.3f}")
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# P-value histogram
axes[0, 0].hist(p_values[true_nulls], bins=50, alpha=0.5, label='True nulls', edgecolor='black')
axes[0, 0].hist(p_values[~true_nulls], bins=50, alpha=0.5, label='False nulls', edgecolor='black')
axes[0, 0].axvline(alpha, color='red', linestyle='--', label=f'Ξ± = {alpha}')
axes[0, 0].set_xlabel('P-value')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('P-value Distribution by True Status')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# Discoveries
methods_list = list(metrics.keys())
discoveries = [metrics[m]['discoveries'] for m in methods_list]
axes[0, 1].bar(methods_list, discoveries, alpha=0.7, edgecolor='black')
axes[0, 1].axhline(m1, color='red', linestyle='--', label=f'True positives available: {m1}')
axes[0, 1].set_ylabel('Number of Discoveries')
axes[0, 1].set_title('Total Discoveries by Method')
axes[0, 1].set_xticklabels(methods_list, rotation=15, ha='right')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3, axis='y')
# FDR
fdrs = [metrics[m]['fdr'] for m in methods_list]
axes[1, 0].bar(methods_list, fdrs, alpha=0.7, color='orange', edgecolor='black')
axes[1, 0].axhline(alpha, color='red', linestyle='--', label=f'Target: {alpha}')
axes[1, 0].set_ylabel('False Discovery Rate')
axes[1, 0].set_title('Observed FDR')
axes[1, 0].set_xticklabels(methods_list, rotation=15, ha='right')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3, axis='y')
# Power
powers = [metrics[m]['power'] for m in methods_list]
axes[1, 1].bar(methods_list, powers, alpha=0.7, color='green', edgecolor='black')
axes[1, 1].set_ylabel('Power')
axes[1, 1].set_title('Statistical Power (Sensitivity)')
axes[1, 1].set_ylim([0, 1])
axes[1, 1].set_xticklabels(methods_list, rotation=15, ha='right')
axes[1, 1].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
print("\nπ‘ Key Observations:")
print(" β’ Bonferroni/Holm: Conservative, low FDR but also low power")
print(" β’ BH (FDR): More discoveries, higher power, FDR controlled")
print(" β’ BY: More conservative than BH, handles dependencies")
print(" β’ Trade-off: FWER (few false positives) vs FDR (more power)")
13.3 Method Comparison and SelectionΒΆ
Visual Comparison of ThresholdsΒΆ
For m tests with ordered p-values \(p_{(1)} \leq \cdots \leq p_{(m)}\):
Bonferroni: Reject if \(p_{(i)} \leq \alpha/m\)
Holm: Reject if \(p_{(i)} \leq \alpha/(m-i+1)\)
BH (FDR): Reject if \(p_{(i)} \leq i \cdot q/m\)
# Visualize Different Thresholds
m = 100
alpha = 0.05
ranks = np.arange(1, m+1)
# Calculate thresholds
bonf_threshold = np.full(m, alpha/m)
holm_threshold = alpha / (m - ranks + 1)
bh_threshold = ranks * alpha / m
plt.figure(figsize=(12, 6))
plt.plot(ranks, bonf_threshold, 'r-', linewidth=2, label='Bonferroni')
plt.plot(ranks, holm_threshold, 'g-', linewidth=2, label='Holm')
plt.plot(ranks, bh_threshold, 'b-', linewidth=2, label='Benjamini-Hochberg')
plt.xlabel('Rank (i)')
plt.ylabel('Threshold')
plt.title(f'Rejection Thresholds by Method (m={m}, Ξ±={alpha})')
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim([0, max(bh_threshold) * 1.1])
plt.show()
print("π Threshold Comparison:")
print(f" Bonferroni: Constant at {alpha/m:.6f}")
print(f" Holm: Decreases from {alpha/m:.6f} to {alpha:.6f}")
print(f" BH: Increases from {alpha/m:.6f} to {alpha:.6f}")
print("\nπ‘ BH allows more discoveries (increasing threshold)")
print(" Holm more powerful than Bonferroni (adaptive threshold)")
13.4 Practical GuidelinesΒΆ
When to Use Each MethodΒΆ
Scenario |
Method |
Reason |
|---|---|---|
Few tests (m < 20) |
Bonferroni |
Simple, conservative OK |
Need no false positives |
Bonferroni/Holm |
Strong FWER control |
Genomics (large m) |
BH (FDR) |
Power matters, some FP OK |
Exploratory analysis |
BH (FDR) |
Maximize discoveries |
Confirmatory study |
Bonferroni/Holm |
Strong control |
Unknown dependencies |
BY (FDR) |
Safe under any structure |
Clinical trials |
Bonferroni |
Regulatory requirement |
Choosing FWER vs FDRΒΆ
Use FWER (Bonferroni/Holm) if:
Canβt tolerate ANY false positives
Small number of tests
Confirmatory/regulatory setting
Follow-up is expensive
Use FDR (BH/BY) if:
Large number of tests (m > 100)
Exploratory analysis
Some false positives acceptable
Want to maximize true discoveries
Follow-up validation planned
Common MistakesΒΆ
β Not correcting at all when m is large
β Using Bonferroni for large m (too conservative)
β Confusing FWER and FDR
β Cherry-picking significant results
β Ignoring multiple testing in exploratory work
β Over-correcting (correcting twice)
Best PracticesΒΆ
β
Pre-specify correction method
β
Report both raw and adjusted p-values
β
State clearly what is being controlled (FWER or FDR)
β
Use FDR for large-scale screening
β
Use FWER for confirmatory tests
β
Consider context (exploratory vs confirmatory)
ReportingΒΆ
Good reporting includes:
Number of tests conducted (m)
Correction method used
Target error rate (Ξ± or q)
Both raw and adjusted p-values
Number of discoveries
Estimated FDR or FWER
# Comprehensive Simulation Study
# Vary: m (number of tests) and Οβ (proportion of true nulls)
def simulate_multiple_testing(m, m0, n=30, effect_size=0.8, n_sim=100):
"""Simulate multiple testing scenario"""
results = {'bonf': [], 'holm': [], 'bh': [], 'by': []}
for _ in range(n_sim):
p_values = []
true_nulls = []
# True nulls
for i in range(m0):
g1 = np.random.randn(n)
g2 = np.random.randn(n)
_, p = stats.ttest_ind(g1, g2)
p_values.append(p)
true_nulls.append(True)
# False nulls
for i in range(m - m0):
g1 = np.random.randn(n)
g2 = np.random.randn(n) + effect_size
_, p = stats.ttest_ind(g1, g2)
p_values.append(p)
true_nulls.append(False)
p_values = np.array(p_values)
true_nulls = np.array(true_nulls)
# Apply corrections
reject_bonf, _, _, _ = multipletests(p_values, alpha=0.05, method='bonferroni')
reject_holm, _, _, _ = multipletests(p_values, alpha=0.05, method='holm')
reject_bh, _, _, _ = multipletests(p_values, alpha=0.05, method='fdr_bh')
reject_by, _, _, _ = multipletests(p_values, alpha=0.05, method='fdr_by')
# Store results
for method, reject in [('bonf', reject_bonf), ('holm', reject_holm),
('bh', reject_bh), ('by', reject_by)]:
metrics = calc_metrics(reject, true_nulls)
results[method].append(metrics)
# Average across simulations
avg_results = {}
for method in results:
avg_results[method] = {
'power': np.mean([r['power'] for r in results[method]]),
'fdr': np.mean([r['fdr'] for r in results[method]]),
'discoveries': np.mean([r['discoveries'] for r in results[method]])
}
return avg_results
# Run simulations
print("π Running simulations (this may take a minute)...\n")
scenarios = [
{'m': 100, 'm0': 90, 'label': 'm=100, 90% null'},
{'m': 100, 'm0': 50, 'label': 'm=100, 50% null'},
{'m': 1000, 'm0': 900, 'label': 'm=1000, 90% null'},
{'m': 1000, 'm0': 500, 'label': 'm=1000, 50% null'}
]
all_results = []
for scenario in scenarios:
result = simulate_multiple_testing(scenario['m'], scenario['m0'], n_sim=50)
result['scenario'] = scenario['label']
all_results.append(result)
# Plot results
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
methods = ['bonf', 'holm', 'bh', 'by']
method_labels = ['Bonferroni', 'Holm', 'BH', 'BY']
x = np.arange(len(scenarios))
width = 0.2
for i, method in enumerate(methods):
# Power
powers = [r[method]['power'] for r in all_results]
axes[0].bar(x + i*width, powers, width, label=method_labels[i], alpha=0.8)
# FDR
fdrs = [r[method]['fdr'] for r in all_results]
axes[1].bar(x + i*width, fdrs, width, label=method_labels[i], alpha=0.8)
# Discoveries
disc = [r[method]['discoveries'] for r in all_results]
axes[2].bar(x + i*width, disc, width, label=method_labels[i], alpha=0.8)
axes[0].set_ylabel('Power')
axes[0].set_title('Statistical Power')
axes[0].set_xticks(x + 1.5*width)
axes[0].set_xticklabels([s['label'] for s in scenarios], rotation=15, ha='right', fontsize=9)
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis='y')
axes[1].set_ylabel('FDR')
axes[1].set_title('False Discovery Rate')
axes[1].axhline(0.05, color='red', linestyle='--', alpha=0.5, label='Target (0.05)')
axes[1].set_xticks(x + 1.5*width)
axes[1].set_xticklabels([s['label'] for s in scenarios], rotation=15, ha='right', fontsize=9)
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')
axes[2].set_ylabel('Discoveries')
axes[2].set_title('Number of Discoveries')
axes[2].set_xticks(x + 1.5*width)
axes[2].set_xticklabels([s['label'] for s in scenarios], rotation=15, ha='right', fontsize=9)
axes[2].legend()
axes[2].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
print("\nπ‘ Simulation Insights:")
print(" β’ FDR methods (BH, BY) have higher power")
print(" β’ Bonferroni very conservative as m increases")
print(" β’ BH makes more discoveries while controlling FDR")
print(" β’ Trade-off between power and false positive control")
Key TakeawaysΒΆ
The Multiple Testing ProblemΒΆ
Core Issue:
Testing at Ξ± = 0.05 is fine for one test
With m tests, expect Ξ± Γ m false positives
Must adjust for multiple comparisons
Error Rate ControlΒΆ
FWER (Family-Wise Error Rate):
Probability of β₯ 1 false positive
Strong control
Conservative (low power)
FDR (False Discovery Rate):
Expected proportion of false positives
Less conservative
Higher power
Method Quick ReferenceΒΆ
Bonferroni:
Test at Ξ±/m
Controls FWER
Very conservative
Use: Small m, confirmatory
Holm:
Step-down procedure
Controls FWER
More powerful than Bonferroni
Use: Small m, prefer over Bonferroni
Benjamini-Hochberg (BH):
Step-up procedure
Controls FDR
Good power
Use: Large m, exploratory, default choice
Benjamini-Yekutieli (BY):
Like BH but more conservative
Controls FDR under any dependency
Use: When dependencies unknown
Decision TreeΒΆ
Multiple tests?
ββ No β Standard hypothesis test
ββ Yes β How many?
ββ Few (< 20) β Can tolerate FP?
β ββ No β Bonferroni or Holm
β ββ Yes β BH (FDR)
ββ Many (β₯ 20) β Context?
ββ Confirmatory β Holm
ββ Exploratory β BH (FDR) β Most common
Python ImplementationΒΆ
from statsmodels.stats.multitest import multipletests
# Apply correction
reject, pvals_corrected, alphacSidak, alphacBonf = multipletests(
pvals,
alpha=0.05,
method='fdr_bh' # or 'bonferroni', 'holm', 'fdr_by'
)
# reject: boolean array of rejections
# pvals_corrected: adjusted p-values
Reporting ExampleΒΆ
Good:
βWe conducted 1,000 hypothesis tests. To control the false discovery rate at 5%, we applied the Benjamini-Hochberg procedure. This yielded 47 significant associations (adjusted p < 0.05).β
Bad:
βWe found 200 significant results (p < 0.05).β β No mention of correction!
Common ApplicationsΒΆ
Genomics:
Test 20,000 genes
Use BH (FDR)
Typical q = 0.05 or 0.10
Neuroimaging:
Test 100,000+ voxels
Use cluster-based corrections or FDR
A/B Testing:
Multiple metrics
Multiple segments
Use BH or Bonferroni depending on context
Clinical Trials:
Multiple endpoints
Multiple comparisons
Often use Bonferroni (regulatory)
Series Complete! πΒΆ
Chapters 2-13 of ISLP covered:
Statistical Learning
Linear Models
Classification
Resampling
Model Selection
Non-linearity
Tree Methods
Support Vector Machines
Deep Learning
Survival Analysis
Unsupervised Learning
Multiple Testing β
You now have a comprehensive foundation in statistical learning!
Practice ExercisesΒΆ
Exercise 1: Power AnalysisΒΆ
Simulate scenario with m = 100 tests:
Vary effect size: 0.2, 0.5, 0.8
Apply Bonferroni, Holm, BH
Plot power vs effect size for each method
Find minimum effect size for 80% power
Exercise 2: FDR ValidationΒΆ
Verify BH controls FDR:
Simulate 1000 experiments, each with m = 100 tests
90% true nulls, 10% false nulls
Apply BH at q = 0.05
Calculate observed FDR for each experiment
Show that mean FDR β€ 0.05
Exercise 3: Real DataΒΆ
Gene expression dataset:
Test all genes for differential expression
Apply no correction, Bonferroni, BH
Create volcano plot
Compare number of discoveries
Report findings properly
Exercise 4: Sample SizeΒΆ
For fixed m = 50 tests:
Vary n (sample size): 10, 30, 50, 100
Fixed effect size: 0.5
Apply BH procedure
Plot power vs sample size
Determine required n for 80% power
Exercise 5: Method ComparisonΒΆ
Comprehensive comparison:
Create 2Γ2 grid: m β {100, 1000}, Οβ β {0.5, 0.9}
For each scenario, compare all 4 methods
Report: power, FDR, discoveries
Recommend best method for each scenario
Exercise 6: Dependent TestsΒΆ
Simul ate correlated p-values:
Generate multivariate normal data with correlation
Test multiple correlated hypotheses
Compare BH vs BY
Verify BY is more conservative
Assess impact of correlation on FDR control