Run this notebook: Open in Colab Open in Kaggle

Statistical Inference¶

Confidence intervals, hypothesis testing, p-values, and Bayesian inference for data-driven decisions.

Statistical inference is the mathematical framework for drawing conclusions about a population from a limited sample of data. In machine learning, inference underpins every decision we make: Is model A genuinely better than model B, or did we just get lucky on this test set? How confident should we be in a prediction? When we report an accuracy of 92%, what is the range of plausible true accuracies?

The two major schools of thought – frequentist and Bayesian – offer complementary perspectives. Frequentist methods (confidence intervals, hypothesis tests) reason about the long-run frequency of events, while Bayesian methods treat probability as a degree of belief and update it as data arrives. Understanding both is essential for rigorous ML experimentation and for uncertainty quantification in deployed models.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import pandas as pd

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

1. Population vs Sample¶

The fundamental problem of statistics¶

In almost every real-world scenario, we cannot observe an entire population – all possible data points. Instead, we collect a sample, a manageable subset, and use it to estimate the population’s characteristics. The central question of statistical inference is: how reliably can sample statistics approximate population parameters?

Population mean $\mu$ is estimated by the sample mean $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$
Population standard deviation $\sigma$ is estimated by the sample standard deviation $s = \sqrt{\frac{1}{n-1}\sum(x_i - \bar{x})^2}$

The denominator $n-1$ (Bessel’s correction) compensates for the fact that a sample tends to underestimate variability. As $n$ grows, sample statistics converge to population parameters – a result formalized by the Law of Large Numbers. The Central Limit Theorem further guarantees that the distribution of sample means is approximately normal with standard error $\sigma / \sqrt{n}$, regardless of the population’s shape. In ML, this principle governs how cross-validation scores, training loss estimates, and evaluation metrics behave across different random splits.

# Simulate a population
population_mean = 100
population_std = 15
population = np.random.normal(population_mean, population_std, 100000)

print("=== Population (True Parameters) ===")
print(f"Mean: {np.mean(population):.2f}")
print(f"Std: {np.std(population):.2f}")

# Take different sample sizes
sample_sizes = [10, 50, 200, 1000]

print("\n=== Samples ===")
for n in sample_sizes:
    sample = np.random.choice(population, size=n, replace=False)
    print(f"n={n:4d}: mean={np.mean(sample):6.2f}, std={np.std(sample, ddof=1):5.2f}")

print("\nLarger samples → closer to population parameters!")

# Sampling distribution of the mean
sample_size = 30
n_samples = 1000

sample_means = []
for _ in range(n_samples):
    sample = np.random.choice(population, size=sample_size, replace=True)
    sample_means.append(np.mean(sample))

sample_means = np.array(sample_means)

# Plot
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
plt.hist(population, bins=50, density=True, alpha=0.7, label='Population')
plt.axvline(population_mean, color='r', linestyle='--', linewidth=2, label=f'μ = {population_mean}')
plt.xlabel('Value', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Population Distribution', fontsize=14)
plt.legend(fontsize=11)

plt.subplot(1, 2, 2)
plt.hist(sample_means, bins=50, density=True, alpha=0.7, label='Sample means')
plt.axvline(np.mean(sample_means), color='r', linestyle='--', linewidth=2, 
           label=f'Mean of means = {np.mean(sample_means):.2f}')
plt.xlabel('Sample Mean', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title(f'Sampling Distribution (n={sample_size})', fontsize=14)
plt.legend(fontsize=11)

plt.tight_layout()
plt.show()

print(f"Standard error (theoretical): {population_std / np.sqrt(sample_size):.2f}")
print(f"Standard deviation of sample means: {np.std(sample_means):.2f}")
print("\nSampling distribution is narrower and approximately normal!")

2. Confidence Intervals¶

Confidence interval: Range likely to contain the true parameter

For a mean with known population std: $$\bar{x} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}$$

For unknown std (more common), use t-distribution: $$\bar{x} \pm t_{\alpha/2} \cdot \frac{s}{\sqrt{n}}$$

95% CI means: If we repeat sampling many times, 95% of intervals will contain true mean

def confidence_interval(data, confidence=0.95):
    """Calculate confidence interval for mean"""
    n = len(data)
    mean = np.mean(data)
    se = stats.sem(data)  # Standard error
    margin = se * stats.t.ppf((1 + confidence) / 2, n - 1)
    return mean - margin, mean + margin

# Take a sample
sample = np.random.choice(population, size=50)
ci_lower, ci_upper = confidence_interval(sample)

print(f"Sample mean: {np.mean(sample):.2f}")
print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
print(f"True population mean: {population_mean}")
print(f"\nDoes CI contain true mean? {ci_lower <= population_mean <= ci_upper}")

# Visualize multiple confidence intervals
n_intervals = 50
sample_size = 30
confidence = 0.95

intervals = []
contains_true_mean = []

for i in range(n_intervals):
    sample = np.random.choice(population, size=sample_size)
    ci = confidence_interval(sample, confidence)
    intervals.append(ci)
    contains_true_mean.append(ci[0] <= population_mean <= ci[1])

# Plot
plt.figure(figsize=(12, 10))
for i, (ci, contains) in enumerate(zip(intervals, contains_true_mean)):
    color = 'green' if contains else 'red'
    plt.plot([ci[0], ci[1]], [i, i], color=color, linewidth=2, alpha=0.6)
    plt.plot(np.mean(ci), i, 'o', color=color, markersize=4)

plt.axvline(population_mean, color='blue', linestyle='--', linewidth=2, 
           label=f'True mean = {population_mean}')
plt.xlabel('Value', fontsize=12)
plt.ylabel('Sample Number', fontsize=12)
plt.title(f'{int(confidence*100)}% Confidence Intervals (n={sample_size})', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3, axis='x')
plt.show()

coverage = np.mean(contains_true_mean) * 100
print(f"Percentage of CIs containing true mean: {coverage:.1f}%")
print(f"Expected: {confidence*100:.0f}%")
print("\nGreen = contains true mean, Red = doesn't contain true mean")

3. Hypothesis Testing¶

Framework for making decisions from data

Null hypothesis (H₀): Default assumption (e.g., “no effect”)
Alternative hypothesis (H₁): What we want to prove
Test statistic: Calculate from data
p-value: Probability of seeing data this extreme if H₀ is true
Decision: Reject H₀ if p < α (typically 0.05)

Example: Is a new ML model better than baseline?

# One-sample t-test
# H0: population mean = 100
# H1: population mean ≠ 100

# Take a sample
sample = np.random.choice(population, size=50)

# Perform t-test
t_stat, p_value = stats.ttest_1samp(sample, 100)

print("=== One-Sample t-test ===")
print(f"Sample mean: {np.mean(sample):.2f}")
print(f"Null hypothesis: μ = 100")
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"\nConclusion (α=0.05): ", end="")
if p_value < 0.05:
    print("Reject H0 - mean is significantly different from 100")
else:
    print("Fail to reject H0 - insufficient evidence")

# Two-sample t-test: Compare two groups
# Example: Compare baseline model vs new model accuracy

# Baseline model accuracies (from cross-validation)
baseline_acc = np.array([0.82, 0.85, 0.83, 0.84, 0.85, 0.82, 0.84, 0.83, 0.85, 0.84])

# New model accuracies
new_model_acc = np.array([0.86, 0.88, 0.87, 0.89, 0.87, 0.88, 0.86, 0.88, 0.89, 0.87])

# Perform t-test
t_stat, p_value = stats.ttest_ind(baseline_acc, new_model_acc)

print("=== Two-Sample t-test (Model Comparison) ===")
print(f"Baseline mean accuracy: {np.mean(baseline_acc):.4f} ± {np.std(baseline_acc):.4f}")
print(f"New model mean accuracy: {np.mean(new_model_acc):.4f} ± {np.std(new_model_acc):.4f}")
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"\nConclusion (α=0.05): ", end="")
if p_value < 0.05:
    print("New model is significantly better! 🎉")
else:
    print("No significant difference")

# Visualize
plt.figure(figsize=(10, 6))
data = pd.DataFrame({
    'Accuracy': np.concatenate([baseline_acc, new_model_acc]),
    'Model': ['Baseline']*len(baseline_acc) + ['New Model']*len(new_model_acc)
})
sns.boxplot(x='Model', y='Accuracy', data=data)
sns.swarmplot(x='Model', y='Accuracy', data=data, color='black', alpha=0.5)
plt.title('Model Accuracy Comparison', fontsize=14)
plt.ylabel('Accuracy', fontsize=12)
plt.grid(True, alpha=0.3, axis='y')
plt.show()

4. Understanding p-values¶

What a p-value actually means (and what it does not)¶

The p-value is one of the most used – and most misunderstood – concepts in statistics. Formally, it is the probability of observing a test statistic as extreme as (or more extreme than) the one computed from the data, assuming the null hypothesis is true:

\[p = P(|T| \geq |t_{\text{obs}}| \mid H_0)\]

Common misinterpretations to avoid:

A p-value is not the probability that $H_0$ is true. That would require Bayesian reasoning with a prior.
A p-value is not the probability that the result happened “by chance.” It conditions on $H_0$ being true; it does not assign a probability to $H_0$ itself.
A small p-value means the data would be surprising under $H_0$. Either the null is wrong, or a rare event occurred.

Why this matters for ML: When running hyperparameter searches or testing multiple model architectures, each comparison generates a p-value. Testing $k$ hypotheses at $\alpha = 0.05$ without correction yields an expected $0.05k$ false positives – this is the multiple comparisons problem. Techniques like Bonferroni correction ($\alpha / k$) or False Discovery Rate (FDR) control are essential when comparing many models or features simultaneously.

# Visualize p-value
# Null hypothesis: mean = 0, std = 1

x = np.linspace(-4, 4, 1000)
null_dist = stats.norm(0, 1)
pdf = null_dist.pdf(x)

# Observed test statistic
observed_stat = 2.5

# Calculate p-value (two-tailed)
p_value = 2 * (1 - null_dist.cdf(abs(observed_stat)))

plt.figure(figsize=(12, 6))
plt.plot(x, pdf, linewidth=2, label='Null distribution')

# Shade rejection regions
critical_value = stats.norm.ppf(0.975)  # For α=0.05, two-tailed
x_left = x[x <= -critical_value]
x_right = x[x >= critical_value]

plt.fill_between(x_left, null_dist.pdf(x_left), alpha=0.3, color='red', 
                label='Rejection region (α=0.05)')
plt.fill_between(x_right, null_dist.pdf(x_right), alpha=0.3, color='red')

# Shade p-value area
x_pvalue_right = x[x >= observed_stat]
x_pvalue_left = x[x <= -observed_stat]
plt.fill_between(x_pvalue_right, null_dist.pdf(x_pvalue_right), alpha=0.5, 
                color='orange', label=f'p-value area')
plt.fill_between(x_pvalue_left, null_dist.pdf(x_pvalue_left), alpha=0.5, color='orange')

plt.axvline(observed_stat, color='green', linestyle='--', linewidth=2, 
           label=f'Observed statistic = {observed_stat}')
plt.axvline(-observed_stat, color='green', linestyle='--', linewidth=2)

plt.xlabel('Test Statistic', fontsize=12)
plt.ylabel('Probability Density', fontsize=12)
plt.title(f'p-value Visualization (p = {p_value:.4f})', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

print(f"Observed statistic: {observed_stat}")
print(f"p-value: {p_value:.4f}")
print(f"\nSince p < 0.05, we reject H₀")

5. Type I and Type II Errors¶

Reality → Decision ↓	H₀ is True	H₀ is False
Reject H₀	Type I Error (α) False Positive	Correct ✓ Power (1-β)
Fail to Reject H₀	Correct ✓	Type II Error (β) False Negative

Type I Error (α): Reject true null (false alarm) Type II Error (β): Fail to reject false null (missed detection) Power (1-β): Probability of detecting true effect

# Simulate Type I and Type II errors
n_simulations = 10000
sample_size = 30
alpha = 0.05

# Type I error: H0 is actually true
type1_errors = 0
for _ in range(n_simulations):
    # Sample from null distribution (mean=0)
    sample = np.random.normal(0, 1, sample_size)
    _, p_value = stats.ttest_1samp(sample, 0)
    if p_value < alpha:
        type1_errors += 1

type1_rate = type1_errors / n_simulations
print("=== Type I Error Rate ===")
print(f"Theoretical α: {alpha}")
print(f"Observed rate: {type1_rate:.4f}")
print(f"Close to α? {abs(type1_rate - alpha) < 0.01}")

# Type II error: H0 is false (true mean = 0.5)
true_effect = 0.5
type2_errors = 0
for _ in range(n_simulations):
    # Sample from alternative distribution (mean=0.5)
    sample = np.random.normal(true_effect, 1, sample_size)
    _, p_value = stats.ttest_1samp(sample, 0)
    if p_value >= alpha:  # Failed to reject
        type2_errors += 1

type2_rate = type2_errors / n_simulations
power = 1 - type2_rate

print("\n=== Type II Error Rate (true effect = 0.5) ===")
print(f"Type II error rate (β): {type2_rate:.4f}")
print(f"Statistical power (1-β): {power:.4f}")
print(f"\nPower = {power:.1%} chance of detecting true effect")

6. Frequentist vs Bayesian¶

Two philosophical frameworks for reasoning under uncertainty¶

The distinction between frequentist and Bayesian approaches is not just academic – it shapes how we build, evaluate, and deploy ML systems.

Frequentist approach treats model parameters as fixed but unknown constants. Probability is defined as the long-run frequency of events: if you repeat an experiment infinitely, the fraction of times event $A$ occurs converges to $P(A)$. Tools include confidence intervals and p-values. The advantage is objectivity – no subjective prior is needed. The limitation is that statements like “there is a 95% probability the true parameter lies in this interval” are technically invalid under frequentist logic (the parameter is fixed, not random).

Bayesian approach treats parameters as random variables with probability distributions that encode our degree of belief. Given observed data $D$, we update a prior belief $P(\theta)$ into a posterior $P(\theta | D)$ using Bayes’ theorem:

\[P(\theta | D) = \frac{P(D | \theta) \cdot P(\theta)}{P(D)}\]

A Bayesian credible interval does mean “there is a 95% probability the parameter lies here,” which is often the more natural interpretation. In modern ML, Bayesian methods power uncertainty estimation in neural networks (Bayesian deep learning), probabilistic programming (PyMC, Stan), and Gaussian processes for active learning and Bayesian optimization of hyperparameters.

# Bayesian estimation example: coin flip
# Prior: Beta(2, 2) - slightly informed prior
# Data: 7 heads out of 10 flips

from scipy.stats import beta

# Prior
alpha_prior, beta_prior = 2, 2

# Data
heads = 7
tails = 3

# Posterior (Beta is conjugate prior for Binomial)
alpha_post = alpha_prior + heads
beta_post = beta_prior + tails

# Plot
p = np.linspace(0, 1, 1000)
prior_dist = beta(alpha_prior, beta_prior)
post_dist = beta(alpha_post, beta_post)

plt.figure(figsize=(12, 6))
plt.plot(p, prior_dist.pdf(p), label='Prior: Beta(2,2)', linewidth=2)
plt.axvline(heads/(heads+tails), color='red', linestyle='--', 
           label=f'MLE = {heads/(heads+tails):.2f}', linewidth=2)
plt.plot(p, post_dist.pdf(p), label=f'Posterior: Beta({alpha_post},{beta_post})', linewidth=2)
plt.fill_between(p, post_dist.pdf(p), alpha=0.3)

plt.xlabel('Probability of Heads', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Bayesian Update: Prior → Posterior', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

# Posterior mean and credible interval
post_mean = alpha_post / (alpha_post + beta_post)
credible_interval = post_dist.ppf([0.025, 0.975])

print("=== Bayesian Results ===")
print(f"Posterior mean: {post_mean:.4f}")
print(f"95% Credible interval: [{credible_interval[0]:.4f}, {credible_interval[1]:.4f}]")
print("\nInterpretation: 95% probability that true p is in this interval")

Summary¶

✅ Population vs Sample: Infer population parameters from samples ✅ Confidence Intervals: Range likely containing true parameter ✅ Hypothesis Testing: Framework for making decisions from data ✅ p-values: Probability of data this extreme under H₀ ✅ Type I/II Errors: False positives vs false negatives ✅ Frequentist vs Bayesian: Two philosophical approaches

Key Insights for ML:¶

Model comparison: Use hypothesis tests to compare model performance
A/B testing: Test if new feature improves metrics
Confidence intervals: Report uncertainty in predictions
Multiple testing: Beware of p-hacking when testing many hypotheses
Bayesian ML: Uncertainty quantification, prior knowledge

Practical Tips:¶

Always report confidence intervals, not just point estimates
Use cross-validation for model comparison
Consider effect size, not just statistical significance
Larger samples give more power
Bayesian methods useful when you have prior knowledge

Next Steps:¶

Study bootstrap methods for non-parametric inference
Learn about multiple testing corrections (Bonferroni, FDR)
Explore Bayesian inference with MCMC
Understand statistical power analysis