Statistical InferenceΒΆ
Confidence intervals, hypothesis testing, p-values, and Bayesian inference for data-driven decisions.
Statistical inference is the mathematical framework for drawing conclusions about a population from a limited sample of data. In machine learning, inference underpins every decision we make: Is model A genuinely better than model B, or did we just get lucky on this test set? How confident should we be in a prediction? When we report an accuracy of 92%, what is the range of plausible true accuracies?
The two major schools of thought β frequentist and Bayesian β offer complementary perspectives. Frequentist methods (confidence intervals, hypothesis tests) reason about the long-run frequency of events, while Bayesian methods treat probability as a degree of belief and update it as data arrives. Understanding both is essential for rigorous ML experimentation and for uncertainty quantification in deployed models.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import pandas as pd
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)
1. Population vs SampleΒΆ
The fundamental problem of statisticsΒΆ
In almost every real-world scenario, we cannot observe an entire population β all possible data points. Instead, we collect a sample, a manageable subset, and use it to estimate the populationβs characteristics. The central question of statistical inference is: how reliably can sample statistics approximate population parameters?
Population mean \(\mu\) is estimated by the sample mean \(\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\)
Population standard deviation \(\sigma\) is estimated by the sample standard deviation \(s = \sqrt{\frac{1}{n-1}\sum(x_i - \bar{x})^2}\)
The denominator \(n-1\) (Besselβs correction) compensates for the fact that a sample tends to underestimate variability. As \(n\) grows, sample statistics converge to population parameters β a result formalized by the Law of Large Numbers. The Central Limit Theorem further guarantees that the distribution of sample means is approximately normal with standard error \(\sigma / \sqrt{n}\), regardless of the populationβs shape. In ML, this principle governs how cross-validation scores, training loss estimates, and evaluation metrics behave across different random splits.
# Simulate a population
population_mean = 100
population_std = 15
population = np.random.normal(population_mean, population_std, 100000)
print("=== Population (True Parameters) ===")
print(f"Mean: {np.mean(population):.2f}")
print(f"Std: {np.std(population):.2f}")
# Take different sample sizes
sample_sizes = [10, 50, 200, 1000]
print("\n=== Samples ===")
for n in sample_sizes:
sample = np.random.choice(population, size=n, replace=False)
print(f"n={n:4d}: mean={np.mean(sample):6.2f}, std={np.std(sample, ddof=1):5.2f}")
print("\nLarger samples β closer to population parameters!")
# Sampling distribution of the mean
sample_size = 30
n_samples = 1000
sample_means = []
for _ in range(n_samples):
sample = np.random.choice(population, size=sample_size, replace=True)
sample_means.append(np.mean(sample))
sample_means = np.array(sample_means)
# Plot
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
plt.hist(population, bins=50, density=True, alpha=0.7, label='Population')
plt.axvline(population_mean, color='r', linestyle='--', linewidth=2, label=f'ΞΌ = {population_mean}')
plt.xlabel('Value', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Population Distribution', fontsize=14)
plt.legend(fontsize=11)
plt.subplot(1, 2, 2)
plt.hist(sample_means, bins=50, density=True, alpha=0.7, label='Sample means')
plt.axvline(np.mean(sample_means), color='r', linestyle='--', linewidth=2,
label=f'Mean of means = {np.mean(sample_means):.2f}')
plt.xlabel('Sample Mean', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title(f'Sampling Distribution (n={sample_size})', fontsize=14)
plt.legend(fontsize=11)
plt.tight_layout()
plt.show()
print(f"Standard error (theoretical): {population_std / np.sqrt(sample_size):.2f}")
print(f"Standard deviation of sample means: {np.std(sample_means):.2f}")
print("\nSampling distribution is narrower and approximately normal!")
2. Confidence IntervalsΒΆ
Confidence interval: Range likely to contain the true parameter
For a mean with known population std: $\(\bar{x} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}\)$
For unknown std (more common), use t-distribution: $\(\bar{x} \pm t_{\alpha/2} \cdot \frac{s}{\sqrt{n}}\)$
95% CI means: If we repeat sampling many times, 95% of intervals will contain true mean
def confidence_interval(data, confidence=0.95):
"""Calculate confidence interval for mean"""
n = len(data)
mean = np.mean(data)
se = stats.sem(data) # Standard error
margin = se * stats.t.ppf((1 + confidence) / 2, n - 1)
return mean - margin, mean + margin
# Take a sample
sample = np.random.choice(population, size=50)
ci_lower, ci_upper = confidence_interval(sample)
print(f"Sample mean: {np.mean(sample):.2f}")
print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
print(f"True population mean: {population_mean}")
print(f"\nDoes CI contain true mean? {ci_lower <= population_mean <= ci_upper}")
# Visualize multiple confidence intervals
n_intervals = 50
sample_size = 30
confidence = 0.95
intervals = []
contains_true_mean = []
for i in range(n_intervals):
sample = np.random.choice(population, size=sample_size)
ci = confidence_interval(sample, confidence)
intervals.append(ci)
contains_true_mean.append(ci[0] <= population_mean <= ci[1])
# Plot
plt.figure(figsize=(12, 10))
for i, (ci, contains) in enumerate(zip(intervals, contains_true_mean)):
color = 'green' if contains else 'red'
plt.plot([ci[0], ci[1]], [i, i], color=color, linewidth=2, alpha=0.6)
plt.plot(np.mean(ci), i, 'o', color=color, markersize=4)
plt.axvline(population_mean, color='blue', linestyle='--', linewidth=2,
label=f'True mean = {population_mean}')
plt.xlabel('Value', fontsize=12)
plt.ylabel('Sample Number', fontsize=12)
plt.title(f'{int(confidence*100)}% Confidence Intervals (n={sample_size})', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3, axis='x')
plt.show()
coverage = np.mean(contains_true_mean) * 100
print(f"Percentage of CIs containing true mean: {coverage:.1f}%")
print(f"Expected: {confidence*100:.0f}%")
print("\nGreen = contains true mean, Red = doesn't contain true mean")
3. Hypothesis TestingΒΆ
Framework for making decisions from data
Null hypothesis (Hβ): Default assumption (e.g., βno effectβ)
Alternative hypothesis (Hβ): What we want to prove
Test statistic: Calculate from data
p-value: Probability of seeing data this extreme if Hβ is true
Decision: Reject Hβ if p < Ξ± (typically 0.05)
Example: Is a new ML model better than baseline?
# One-sample t-test
# H0: population mean = 100
# H1: population mean β 100
# Take a sample
sample = np.random.choice(population, size=50)
# Perform t-test
t_stat, p_value = stats.ttest_1samp(sample, 100)
print("=== One-Sample t-test ===")
print(f"Sample mean: {np.mean(sample):.2f}")
print(f"Null hypothesis: ΞΌ = 100")
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion (Ξ±=0.05): ", end="")
if p_value < 0.05:
print("Reject H0 - mean is significantly different from 100")
else:
print("Fail to reject H0 - insufficient evidence")
# Two-sample t-test: Compare two groups
# Example: Compare baseline model vs new model accuracy
# Baseline model accuracies (from cross-validation)
baseline_acc = np.array([0.82, 0.85, 0.83, 0.84, 0.85, 0.82, 0.84, 0.83, 0.85, 0.84])
# New model accuracies
new_model_acc = np.array([0.86, 0.88, 0.87, 0.89, 0.87, 0.88, 0.86, 0.88, 0.89, 0.87])
# Perform t-test
t_stat, p_value = stats.ttest_ind(baseline_acc, new_model_acc)
print("=== Two-Sample t-test (Model Comparison) ===")
print(f"Baseline mean accuracy: {np.mean(baseline_acc):.4f} Β± {np.std(baseline_acc):.4f}")
print(f"New model mean accuracy: {np.mean(new_model_acc):.4f} Β± {np.std(new_model_acc):.4f}")
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"\nConclusion (Ξ±=0.05): ", end="")
if p_value < 0.05:
print("New model is significantly better! π")
else:
print("No significant difference")
# Visualize
plt.figure(figsize=(10, 6))
data = pd.DataFrame({
'Accuracy': np.concatenate([baseline_acc, new_model_acc]),
'Model': ['Baseline']*len(baseline_acc) + ['New Model']*len(new_model_acc)
})
sns.boxplot(x='Model', y='Accuracy', data=data)
sns.swarmplot(x='Model', y='Accuracy', data=data, color='black', alpha=0.5)
plt.title('Model Accuracy Comparison', fontsize=14)
plt.ylabel('Accuracy', fontsize=12)
plt.grid(True, alpha=0.3, axis='y')
plt.show()
4. Understanding p-valuesΒΆ
What a p-value actually means (and what it does not)ΒΆ
The p-value is one of the most used β and most misunderstood β concepts in statistics. Formally, it is the probability of observing a test statistic as extreme as (or more extreme than) the one computed from the data, assuming the null hypothesis is true:
Common misinterpretations to avoid:
A p-value is not the probability that \(H_0\) is true. That would require Bayesian reasoning with a prior.
A p-value is not the probability that the result happened βby chance.β It conditions on \(H_0\) being true; it does not assign a probability to \(H_0\) itself.
A small p-value means the data would be surprising under \(H_0\). Either the null is wrong, or a rare event occurred.
Why this matters for ML: When running hyperparameter searches or testing multiple model architectures, each comparison generates a p-value. Testing \(k\) hypotheses at \(\alpha = 0.05\) without correction yields an expected \(0.05k\) false positives β this is the multiple comparisons problem. Techniques like Bonferroni correction (\(\alpha / k\)) or False Discovery Rate (FDR) control are essential when comparing many models or features simultaneously.
# Visualize p-value
# Null hypothesis: mean = 0, std = 1
x = np.linspace(-4, 4, 1000)
null_dist = stats.norm(0, 1)
pdf = null_dist.pdf(x)
# Observed test statistic
observed_stat = 2.5
# Calculate p-value (two-tailed)
p_value = 2 * (1 - null_dist.cdf(abs(observed_stat)))
plt.figure(figsize=(12, 6))
plt.plot(x, pdf, linewidth=2, label='Null distribution')
# Shade rejection regions
critical_value = stats.norm.ppf(0.975) # For Ξ±=0.05, two-tailed
x_left = x[x <= -critical_value]
x_right = x[x >= critical_value]
plt.fill_between(x_left, null_dist.pdf(x_left), alpha=0.3, color='red',
label='Rejection region (Ξ±=0.05)')
plt.fill_between(x_right, null_dist.pdf(x_right), alpha=0.3, color='red')
# Shade p-value area
x_pvalue_right = x[x >= observed_stat]
x_pvalue_left = x[x <= -observed_stat]
plt.fill_between(x_pvalue_right, null_dist.pdf(x_pvalue_right), alpha=0.5,
color='orange', label=f'p-value area')
plt.fill_between(x_pvalue_left, null_dist.pdf(x_pvalue_left), alpha=0.5, color='orange')
plt.axvline(observed_stat, color='green', linestyle='--', linewidth=2,
label=f'Observed statistic = {observed_stat}')
plt.axvline(-observed_stat, color='green', linestyle='--', linewidth=2)
plt.xlabel('Test Statistic', fontsize=12)
plt.ylabel('Probability Density', fontsize=12)
plt.title(f'p-value Visualization (p = {p_value:.4f})', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()
print(f"Observed statistic: {observed_stat}")
print(f"p-value: {p_value:.4f}")
print(f"\nSince p < 0.05, we reject Hβ")
5. Type I and Type II ErrorsΒΆ
Reality β |
Hβ is True |
Hβ is False |
|---|---|---|
Reject Hβ |
Type I Error (Ξ±) |
Correct β |
Fail to Reject Hβ |
Correct β |
Type II Error (Ξ²) |
Type I Error (Ξ±): Reject true null (false alarm) Type II Error (Ξ²): Fail to reject false null (missed detection) Power (1-Ξ²): Probability of detecting true effect
# Simulate Type I and Type II errors
n_simulations = 10000
sample_size = 30
alpha = 0.05
# Type I error: H0 is actually true
type1_errors = 0
for _ in range(n_simulations):
# Sample from null distribution (mean=0)
sample = np.random.normal(0, 1, sample_size)
_, p_value = stats.ttest_1samp(sample, 0)
if p_value < alpha:
type1_errors += 1
type1_rate = type1_errors / n_simulations
print("=== Type I Error Rate ===")
print(f"Theoretical Ξ±: {alpha}")
print(f"Observed rate: {type1_rate:.4f}")
print(f"Close to Ξ±? {abs(type1_rate - alpha) < 0.01}")
# Type II error: H0 is false (true mean = 0.5)
true_effect = 0.5
type2_errors = 0
for _ in range(n_simulations):
# Sample from alternative distribution (mean=0.5)
sample = np.random.normal(true_effect, 1, sample_size)
_, p_value = stats.ttest_1samp(sample, 0)
if p_value >= alpha: # Failed to reject
type2_errors += 1
type2_rate = type2_errors / n_simulations
power = 1 - type2_rate
print("\n=== Type II Error Rate (true effect = 0.5) ===")
print(f"Type II error rate (Ξ²): {type2_rate:.4f}")
print(f"Statistical power (1-Ξ²): {power:.4f}")
print(f"\nPower = {power:.1%} chance of detecting true effect")
6. Frequentist vs BayesianΒΆ
Two philosophical frameworks for reasoning under uncertaintyΒΆ
The distinction between frequentist and Bayesian approaches is not just academic β it shapes how we build, evaluate, and deploy ML systems.
Frequentist approach treats model parameters as fixed but unknown constants. Probability is defined as the long-run frequency of events: if you repeat an experiment infinitely, the fraction of times event \(A\) occurs converges to \(P(A)\). Tools include confidence intervals and p-values. The advantage is objectivity β no subjective prior is needed. The limitation is that statements like βthere is a 95% probability the true parameter lies in this intervalβ are technically invalid under frequentist logic (the parameter is fixed, not random).
Bayesian approach treats parameters as random variables with probability distributions that encode our degree of belief. Given observed data \(D\), we update a prior belief \(P(\theta)\) into a posterior \(P(\theta | D)\) using Bayesβ theorem:
A Bayesian credible interval does mean βthere is a 95% probability the parameter lies here,β which is often the more natural interpretation. In modern ML, Bayesian methods power uncertainty estimation in neural networks (Bayesian deep learning), probabilistic programming (PyMC, Stan), and Gaussian processes for active learning and Bayesian optimization of hyperparameters.
# Bayesian estimation example: coin flip
# Prior: Beta(2, 2) - slightly informed prior
# Data: 7 heads out of 10 flips
from scipy.stats import beta
# Prior
alpha_prior, beta_prior = 2, 2
# Data
heads = 7
tails = 3
# Posterior (Beta is conjugate prior for Binomial)
alpha_post = alpha_prior + heads
beta_post = beta_prior + tails
# Plot
p = np.linspace(0, 1, 1000)
prior_dist = beta(alpha_prior, beta_prior)
post_dist = beta(alpha_post, beta_post)
plt.figure(figsize=(12, 6))
plt.plot(p, prior_dist.pdf(p), label='Prior: Beta(2,2)', linewidth=2)
plt.axvline(heads/(heads+tails), color='red', linestyle='--',
label=f'MLE = {heads/(heads+tails):.2f}', linewidth=2)
plt.plot(p, post_dist.pdf(p), label=f'Posterior: Beta({alpha_post},{beta_post})', linewidth=2)
plt.fill_between(p, post_dist.pdf(p), alpha=0.3)
plt.xlabel('Probability of Heads', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Bayesian Update: Prior β Posterior', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()
# Posterior mean and credible interval
post_mean = alpha_post / (alpha_post + beta_post)
credible_interval = post_dist.ppf([0.025, 0.975])
print("=== Bayesian Results ===")
print(f"Posterior mean: {post_mean:.4f}")
print(f"95% Credible interval: [{credible_interval[0]:.4f}, {credible_interval[1]:.4f}]")
print("\nInterpretation: 95% probability that true p is in this interval")
SummaryΒΆ
β Population vs Sample: Infer population parameters from samples β Confidence Intervals: Range likely containing true parameter β Hypothesis Testing: Framework for making decisions from data β p-values: Probability of data this extreme under Hβ β Type I/II Errors: False positives vs false negatives β Frequentist vs Bayesian: Two philosophical approaches
Key Insights for ML:ΒΆ
Model comparison: Use hypothesis tests to compare model performance
A/B testing: Test if new feature improves metrics
Confidence intervals: Report uncertainty in predictions
Multiple testing: Beware of p-hacking when testing many hypotheses
Bayesian ML: Uncertainty quantification, prior knowledge
Practical Tips:ΒΆ
Always report confidence intervals, not just point estimates
Use cross-validation for model comparison
Consider effect size, not just statistical significance
Larger samples give more power
Bayesian methods useful when you have prior knowledge
Next Steps:ΒΆ
Study bootstrap methods for non-parametric inference
Learn about multiple testing corrections (Bonferroni, FDR)
Explore Bayesian inference with MCMC
Understand statistical power analysis