Statistics & Probability for AI/MLΒΆ
Essential statistical concepts every AI practitioner must know
π Table of ContentsΒΆ
IntroductionΒΆ
Statistics is the foundation of machine learning. Understanding probability, distributions, and statistical inference is crucial for:
Model Evaluation: Understanding accuracy, precision, recall
A/B Testing: Comparing model versions
Feature Selection: Identifying important variables
Uncertainty Quantification: Knowing model confidence
Hyperparameter Tuning: Making informed choices
This notebook covers the essential statistics you need for AI/ML work, with practical examples using real datasets.
Probability FundamentalsΒΆ
What is Probability?ΒΆ
Probability is a number between 0 and 1 that expresses how likely an event is to occur.
Example: Rolling a DieΒΆ
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import pandas as pd
# Simulate rolling a fair die 1000 times
np.random.seed(42)
die_rolls = np.random.randint(1, 7, size=1000)
# Calculate probabilities
unique, counts = np.unique(die_rolls, return_counts=True)
probabilities = counts / len(die_rolls)
# Visualize
plt.figure(figsize=(10, 5))
plt.bar(unique, probabilities, alpha=0.7, color='steelblue', edgecolor='black')
plt.axhline(y=1/6, color='red', linestyle='--', label='Theoretical (1/6)')
plt.xlabel('Die Face', fontsize=12)
plt.ylabel('Probability', fontsize=12)
plt.title('Empirical Probability Distribution of Die Rolls', fontsize=14)
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.show()
print("Observed probabilities:")
for face, prob in zip(unique, probabilities):
print(f"Face {face}: {prob:.3f} (expected: {1/6:.3f})")
β Knowledge Check:
Whatβs the probability of rolling an even number on a fair die?
If you roll two dice, whatβs the probability of getting a sum of 7?
Why do empirical probabilities differ from theoretical ones?
Click for answers
Answer: 3/6 = 0.5 (faces 2, 4, 6 are even)
Answer: 6/36 = 1/6 (combinations: 1+6, 2+5, 3+4, 4+3, 5+2, 6+1)
Answer: Random sampling variation. With infinite rolls, empirical β theoretical
Random VariablesΒΆ
Random Variable: A variable whose value is determined by a random process.
Discrete: Countable values (e.g., number of heads in coin flips)
Continuous: Any value in a range (e.g., height, weight, time)
Sample SpaceΒΆ
The sample space is the set of all possible outcomes.
# Example: Flipping 3 coins
from itertools import product
outcomes = list(product(['H', 'T'], repeat=3))
print(f"Sample space (n={len(outcomes)}):")
for outcome in outcomes:
print(''.join(outcome))
# Random variable: Number of heads
num_heads = [sum([1 for flip in outcome if flip == 'H']) for outcome in outcomes]
print(f"\nNumber of heads: {set(num_heads)}")
# Probability distribution
unique_heads, counts = np.unique(num_heads, return_counts=True)
prob_dist = counts / len(outcomes)
plt.figure(figsize=(8, 5))
plt.bar(unique_heads, prob_dist, alpha=0.7, color='coral', edgecolor='black')
plt.xlabel('Number of Heads', fontsize=12)
plt.ylabel('Probability', fontsize=12)
plt.title('Probability Distribution: Number of Heads in 3 Coin Flips', fontsize=14)
plt.xticks(unique_heads)
plt.grid(axis='y', alpha=0.3)
plt.show()
print("\nProbability distribution:")
for heads, prob in zip(unique_heads, prob_dist):
print(f"P(X = {heads}) = {prob:.3f} = {counts[heads]}/8")
Descriptive StatisticsΒΆ
Measures of Central TendencyΒΆ
Letβs analyze a real dataset: MLB baseball player weights and heights.
# Load baseball player data
# (Using simulated data - in practice, load from data/mlb_height_weight.csv)
np.random.seed(42)
n_players = 1034 # Real dataset size
# Generate realistic player data (based on actual MLB statistics)
heights = np.random.normal(loc=73.7, scale=2.3, size=n_players) # inches
weights = np.random.normal(loc=201.7, scale=21.0, size=n_players) # pounds
mlb_data = pd.DataFrame({
'height': heights,
'weight': weights,
'position': np.random.choice(['1B', '2B', 'SS', '3B', 'C', 'OF', 'P'], n_players)
})
# First 20 weights (like in the reference)
sample_weights = weights[:20]
print("First 20 player weights (lbs):")
print(sample_weights)
print(f"\nMean: {np.mean(sample_weights):.2f}")
print(f"Median: {np.median(sample_weights):.2f}")
Mean, Median, ModeΒΆ
# Calculate statistics
mean_weight = mlb_data['weight'].mean()
median_weight = mlb_data['weight'].median()
mode_weight = mlb_data['weight'].mode()[0] if len(mlb_data['weight'].mode()) > 0 else None
std_weight = mlb_data['weight'].std()
var_weight = mlb_data['weight'].var()
print(f"Weight Statistics:")
print(f" Mean: {mean_weight:.2f} lbs")
print(f" Median: {median_weight:.2f} lbs")
print(f" Std Dev: {std_weight:.2f} lbs")
print(f" Variance: {var_weight:.2f}")
print(f"\nHeight Statistics:")
print(f" Mean: {mlb_data['height'].mean():.2f} inches ({mlb_data['height'].mean()/12:.2f} feet)")
print(f" Median: {mlb_data['height'].median():.2f} inches")
print(f" Std Dev: {mlb_data['height'].std():.2f} inches")
# Visualize distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Weight distribution
axes[0].hist(mlb_data['weight'], bins=30, alpha=0.7, color='steelblue', edgecolor='black')
axes[0].axvline(mean_weight, color='red', linestyle='--', label=f'Mean: {mean_weight:.1f}')
axes[0].axvline(median_weight, color='green', linestyle='--', label=f'Median: {median_weight:.1f}')
axes[0].set_xlabel('Weight (lbs)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Distribution of Player Weights', fontsize=14)
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)
# Height distribution
axes[1].hist(mlb_data['height'], bins=30, alpha=0.7, color='coral', edgecolor='black')
axes[1].axvline(mlb_data['height'].mean(), color='red', linestyle='--', label=f'Mean: {mlb_data["height"].mean():.1f}"')
axes[1].axvline(mlb_data['height'].median(), color='green', linestyle='--', label=f'Median: {mlb_data["height"].median():.1f}"')
axes[1].set_xlabel('Height (inches)', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Distribution of Player Heights', fontsize=14)
axes[1].legend()
axes[1].grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
β Knowledge Check:
When is the median more useful than the mean?
What does a large standard deviation tell you?
Why might mode not be useful for continuous data?
Click for answers
Answer: When there are outliers or skewed distributions. Median is robust to extreme values.
Answer: Data is spread out widely from the mean. High variability.
Answer: Continuous data rarely has exact repeats. Better for categorical data.
Probability DistributionsΒΆ
Normal (Gaussian) DistributionΒΆ
The most important distribution in statistics!
# Fit normal distribution to height data
mu, sigma = mlb_data['height'].mean(), mlb_data['height'].std()
# Generate theoretical distribution
x = np.linspace(mlb_data['height'].min(), mlb_data['height'].max(), 100)
theoretical_dist = stats.norm.pdf(x, mu, sigma)
plt.figure(figsize=(12, 6))
plt.hist(mlb_data['height'], bins=30, density=True, alpha=0.6,
color='steelblue', edgecolor='black', label='Observed')
plt.plot(x, theoretical_dist, 'r-', linewidth=2, label=f'Normal(ΞΌ={mu:.1f}, Ο={sigma:.1f})')
plt.xlabel('Height (inches)', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Player Heights vs Normal Distribution', fontsize=14)
plt.legend()
plt.grid(alpha=0.3)
plt.show()
# 68-95-99.7 Rule
print("68-95-99.7 Rule (Empirical Rule):")
print(f" 68% of data within 1Ο: [{mu-sigma:.1f}, {mu+sigma:.1f}]")
print(f" 95% of data within 2Ο: [{mu-2*sigma:.1f}, {mu+2*sigma:.1f}]")
print(f" 99.7% of data within 3Ο: [{mu-3*sigma:.1f}, {mu+3*sigma:.1f}]")
# Verify
within_1sigma = ((mlb_data['height'] >= mu-sigma) & (mlb_data['height'] <= mu+sigma)).mean()
within_2sigma = ((mlb_data['height'] >= mu-2*sigma) & (mlb_data['height'] <= mu+2*sigma)).mean()
within_3sigma = ((mlb_data['height'] >= mu-3*sigma) & (mlb_data['height'] <= mu+3*sigma)).mean()
print(f"\nActual data:")
print(f" Within 1Ο: {within_1sigma:.1%} (expected: 68%)")
print(f" Within 2Ο: {within_2sigma:.1%} (expected: 95%)")
print(f" Within 3Ο: {within_3sigma:.1%} (expected: 99.7%)")
Other Important DistributionsΒΆ
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# 1. Binomial Distribution
n, p = 10, 0.5
x_binom = np.arange(0, n+1)
pmf_binom = stats.binom.pmf(x_binom, n, p)
axes[0, 0].bar(x_binom, pmf_binom, alpha=0.7, color='steelblue', edgecolor='black')
axes[0, 0].set_title(f'Binomial Distribution (n={n}, p={p})', fontsize=12)
axes[0, 0].set_xlabel('Number of successes')
axes[0, 0].set_ylabel('Probability')
axes[0, 0].grid(axis='y', alpha=0.3)
# 2. Poisson Distribution
lambda_param = 3
x_poisson = np.arange(0, 15)
pmf_poisson = stats.poisson.pmf(x_poisson, lambda_param)
axes[0, 1].bar(x_poisson, pmf_poisson, alpha=0.7, color='coral', edgecolor='black')
axes[0, 1].set_title(f'Poisson Distribution (Ξ»={lambda_param})', fontsize=12)
axes[0, 1].set_xlabel('Number of events')
axes[0, 1].set_ylabel('Probability')
axes[0, 1].grid(axis='y', alpha=0.3)
# 3. Exponential Distribution
lambda_exp = 1.5
x_exp = np.linspace(0, 5, 100)
pdf_exp = stats.expon.pdf(x_exp, scale=1/lambda_exp)
axes[1, 0].plot(x_exp, pdf_exp, linewidth=2, color='green')
axes[1, 0].fill_between(x_exp, pdf_exp, alpha=0.3, color='green')
axes[1, 0].set_title(f'Exponential Distribution (Ξ»={lambda_exp})', fontsize=12)
axes[1, 0].set_xlabel('Time between events')
axes[1, 0].set_ylabel('Density')
axes[1, 0].grid(alpha=0.3)
# 4. Uniform Distribution
a, b = 0, 10
x_uniform = np.linspace(a-1, b+1, 100)
pdf_uniform = stats.uniform.pdf(x_uniform, a, b-a)
axes[1, 1].plot(x_uniform, pdf_uniform, linewidth=2, color='purple')
axes[1, 1].fill_between(x_uniform, pdf_uniform, alpha=0.3, color='purple')
axes[1, 1].set_title(f'Uniform Distribution U({a}, {b})', fontsize=12)
axes[1, 1].set_xlabel('Value')
axes[1, 1].set_ylabel('Density')
axes[1, 1].grid(alpha=0.3)
plt.tight_layout()
plt.show()
Statistical InferenceΒΆ
Confidence IntervalsΒΆ
A confidence interval gives a range where we expect the true population parameter to lie.
# Calculate 95% confidence interval for player heights
confidence = 0.95
sample_mean = mlb_data['height'].mean()
sample_std = mlb_data['height'].std()
sample_size = len(mlb_data)
# Standard error
se = sample_std / np.sqrt(sample_size)
# Confidence interval using t-distribution
ci = stats.t.interval(confidence,
df=sample_size-1,
loc=sample_mean,
scale=se)
print(f"95% Confidence Interval for Player Height:")
print(f" Point estimate (mean): {sample_mean:.2f} inches")
print(f" Standard error: {se:.4f}")
print(f" Confidence interval: [{ci[0]:.2f}, {ci[1]:.2f}] inches")
print(f"\nInterpretation: We are 95% confident that the true average height")
print(f"of all MLB players is between {ci[0]:.2f} and {ci[1]:.2f} inches.")
# Visualize
plt.figure(figsize=(10, 6))
x = np.linspace(sample_mean - 4*se, sample_mean + 4*se, 1000)
y = stats.t.pdf(x, df=sample_size-1, loc=sample_mean, scale=se)
plt.plot(x, y, linewidth=2, color='steelblue', label='Sampling Distribution')
plt.axvline(sample_mean, color='red', linestyle='--', linewidth=2, label='Sample Mean')
plt.axvline(ci[0], color='green', linestyle='--', label='95% CI Bounds')
plt.axvline(ci[1], color='green', linestyle='--')
plt.fill_between(x, 0, y, where=(x >= ci[0]) & (x <= ci[1]), alpha=0.3, color='green')
plt.xlabel('Height (inches)', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Sampling Distribution of Mean Height with 95% CI', fontsize=14)
plt.legend()
plt.grid(alpha=0.3)
plt.show()
Hypothesis TestingΒΆ
Comparing Two GroupsΒΆ
Question: Are first basemen (1B) significantly taller than second basemen (2B)?
# Extract heights for each position
heights_1b = mlb_data[mlb_data['position'] == '1B']['height']
heights_2b = mlb_data[mlb_data['position'] == '2B']['height']
print(f"First Basemen (1B): n={len(heights_1b)}, mean={heights_1b.mean():.2f}")
print(f"Second Basemen (2B): n={len(heights_2b)}, mean={heights_2b.mean():.2f}")
print(f"Difference in means: {heights_1b.mean() - heights_2b.mean():.2f} inches")
# Hypothesis Test
# H0 (null): ΞΌ_1B = ΞΌ_2B (no difference)
# H1 (alternative): ΞΌ_1B β ΞΌ_2B (there is a difference)
t_stat, p_value = stats.ttest_ind(heights_1b, heights_2b)
print(f"\nTwo-Sample t-test:")
print(f" t-statistic: {t_stat:.4f}")
print(f" p-value: {p_value:.4f}")
alpha = 0.05
if p_value < alpha:
print(f"\nβ
Result: REJECT null hypothesis (p < {alpha})")
print(f" First basemen are significantly taller than second basemen.")
else:
print(f"\nβ Result: FAIL TO REJECT null hypothesis (p >= {alpha})")
print(f" No significant difference in heights.")
# Visualize
plt.figure(figsize=(12, 6))
plt.hist(heights_1b, bins=20, alpha=0.5, label=f'1B (n={len(heights_1b)})',
color='steelblue', edgecolor='black')
plt.hist(heights_2b, bins=20, alpha=0.5, label=f'2B (n={len(heights_2b)})',
color='coral', edgecolor='black')
plt.axvline(heights_1b.mean(), color='blue', linestyle='--', linewidth=2)
plt.axvline(heights_2b.mean(), color='red', linestyle='--', linewidth=2)
plt.xlabel('Height (inches)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title(f'Height Distribution: 1B vs 2B (p={p_value:.4f})', fontsize=14)
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.show()
β Knowledge Check:
What does a p-value tell you?
Whatβs the difference between statistical and practical significance?
Why do we use Ξ± = 0.05?
Correlation & CausationΒΆ
Measuring CorrelationΒΆ
# Calculate correlation between height and weight
correlation = mlb_data['height'].corr(mlb_data['weight'])
print(f"Pearson correlation (height, weight): {correlation:.4f}")
# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(mlb_data['height'], mlb_data['weight'], alpha=0.5, s=20, color='steelblue')
plt.xlabel('Height (inches)', fontsize=12)
plt.ylabel('Weight (lbs)', fontsize=12)
plt.title(f'Height vs Weight (r = {correlation:.3f})', fontsize=14)
plt.grid(alpha=0.3)
# Add regression line
from scipy.stats import linregress
slope, intercept, r_value, p_value, std_err = linregress(mlb_data['height'], mlb_data['weight'])
line_x = np.array([mlb_data['height'].min(), mlb_data['height'].max()])
line_y = slope * line_x + intercept
plt.plot(line_x, line_y, 'r-', linewidth=2, label=f'y = {slope:.2f}x + {intercept:.2f}')
plt.legend()
plt.show()
print(f"\nRegression Results:")
print(f" Slope: {slope:.2f} lbs/inch")
print(f" RΒ²: {r_value**2:.4f}")
print(f" Interpretation: For every 1 inch increase in height,")
print(f" weight increases by {slope:.2f} lbs on average.")
β οΈ Correlation β Causation!ΒΆ
print("Remember:")
print("β
Correlation measures LINEAR relationship strength")
print("β
Range: -1 (perfect negative) to +1 (perfect positive)")
print("β Correlation does NOT imply causation!")
print("β Confounding variables may be the true cause")
print("β Correlation can be spurious (random)")
Practical Applications in MLΒΆ
1. Model Evaluation with Confidence IntervalsΒΆ
# Simulate model accuracy over multiple runs
np.random.seed(42)
n_runs = 30
accuracies = np.random.normal(loc=0.85, scale=0.03, size=n_runs)
mean_acc = accuracies.mean()
ci_acc = stats.t.interval(0.95, len(accuracies)-1,
loc=mean_acc,
scale=stats.sem(accuracies))
print(f"Model Performance (30 runs):")
print(f" Mean accuracy: {mean_acc:.4f}")
print(f" 95% CI: [{ci_acc[0]:.4f}, {ci_acc[1]:.4f}]")
print(f" Std Dev: {accuracies.std():.4f}")
2. A/B Testing ModelsΒΆ
# Compare two model versions
model_a_acc = np.random.normal(0.84, 0.03, 25)
model_b_acc = np.random.normal(0.87, 0.03, 25)
t_stat, p_val = stats.ttest_ind(model_a_acc, model_b_acc)
print(f"\nA/B Test: Model A vs Model B")
print(f" Model A: {model_a_acc.mean():.4f} Β± {model_a_acc.std():.4f}")
print(f" Model B: {model_b_acc.mean():.4f} Β± {model_b_acc.std():.4f}")
print(f" p-value: {p_val:.4f}")
if p_val < 0.05:
print(f" β
Model B is significantly better!")
else:
print(f" β No significant difference")
π― SummaryΒΆ
Key TakeawaysΒΆ
Probability: Foundation for understanding uncertainty
Distributions: Normal distribution is everywhere in ML
Statistics: Mean, median, std describe data
Inference: Confidence intervals quantify uncertainty
Hypothesis Testing: Compare models/features rigorously
Correlation: Measures relationships, NOT causation
Applications: Essential for model evaluation, A/B testing
Next StepsΒΆ
β Practice with real datasets
β Apply hypothesis testing to your ML experiments
β Use confidence intervals in model evaluation
β Be cautious with correlation claims
β Always visualize your data!
π ResourcesΒΆ
Remember: Statistics is a tool for making informed decisions under uncertainty - crucial for AI/ML! π