Run this notebook: Open in Colab Open in Kaggle

Probability & Statistics¶

Probability distributions, Bayes theorem, hypothesis testing, and statistical inference for AI.

# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm, binom, bernoulli, poisson, uniform, expon
import pandas as pd

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

1. Probability Basics¶

Probability measures the likelihood of an event occurring.

Key properties:

$0 \leq P(A) \leq 1$ for any event A
$P(\text{certain event}) = 1$
$P(\text{impossible event}) = 0$

Basic rules:

Complement: $P(A^c) = 1 - P(A)$
Addition: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$
Multiplication (independent): $P(A \cap B) = P(A) \cdot P(B)$

# Simulate coin flips
def simulate_coin_flips(n_flips=1000, p_heads=0.5):
    """
    Simulate coin flips and track running probability
    """
    flips = np.random.random(n_flips) < p_heads
    running_prob = np.cumsum(flips) / np.arange(1, n_flips + 1)
    return flips, running_prob

# Run simulation
flips, running_prob = simulate_coin_flips(1000, p_heads=0.5)

# Plot
plt.figure(figsize=(12, 5))
plt.plot(running_prob, linewidth=2)
plt.axhline(y=0.5, color='r', linestyle='--', linewidth=2, label='True probability (0.5)')
plt.xlabel('Number of flips', fontsize=12)
plt.ylabel('Observed probability of heads', fontsize=12)
plt.title('Law of Large Numbers: Convergence to True Probability', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

print(f"After {len(flips)} flips:")
print(f"Observed probability: {running_prob[-1]:.4f}")
print(f"True probability: 0.5000")
print(f"Difference: {abs(running_prob[-1] - 0.5):.4f}")

2. Random Variables and Distributions¶

A random variable is a variable whose value is determined by chance.

Types:

Discrete: Takes countable values (e.g., dice roll, number of customers)
Continuous: Takes any value in a range (e.g., height, temperature)

Probability Distribution: Describes how probabilities are distributed over values.

Discrete Distribution Example: Dice Roll¶

A fair six-sided die is the simplest example of a discrete uniform distribution – each outcome has equal probability $P(X = k) = \frac{1}{6}$. Simulating many rolls and comparing observed frequencies to the theoretical probability illustrates the Law of Large Numbers: as the number of trials grows, the empirical distribution converges to the true distribution. This same principle underpins why training on more data generally improves ML model performance – larger samples give more reliable estimates of the underlying patterns.

# Simulate dice rolls
n_rolls = 10000
dice_rolls = np.random.randint(1, 7, size=n_rolls)

# Count frequencies
unique, counts = np.unique(dice_rolls, return_counts=True)
probabilities = counts / n_rolls

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
ax1.bar(unique, counts, alpha=0.7, color='skyblue', edgecolor='black')
ax1.set_xlabel('Dice value', fontsize=12)
ax1.set_ylabel('Frequency', fontsize=12)
ax1.set_title(f'Frequency of {n_rolls} Dice Rolls', fontsize=14)
ax1.set_xticks(range(1, 7))
ax1.grid(True, alpha=0.3, axis='y')

# Probability distribution
theoretical_prob = 1/6
ax2.bar(unique, probabilities, alpha=0.7, color='salmon', edgecolor='black', label='Observed')
ax2.axhline(y=theoretical_prob, color='red', linestyle='--', linewidth=2, label='Theoretical (1/6)')
ax2.set_xlabel('Dice value', fontsize=12)
ax2.set_ylabel('Probability', fontsize=12)
ax2.set_title('Probability Distribution', fontsize=14)
ax2.set_xticks(range(1, 7))
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("Observed probabilities:")
for val, prob in zip(unique, probabilities):
    print(f"  P(X = {val}) = {prob:.4f} (theoretical: {theoretical_prob:.4f})")

3. Expected Value and Variance¶

Expected Value (Mean): Average value over many trials $$E[X] = \mu = \sum_{i} x_i \cdot P(X = x_i)$$

Variance: Measure of spread $$\text{Var}(X) = \sigma^2 = E[(X - \mu)^2] = E[X^2] - (E[X])^2$$

Standard Deviation: $\sigma = \sqrt{\text{Var}(X)}$

# Expected value and variance for dice roll
# Theoretical values
values = np.array([1, 2, 3, 4, 5, 6])
probs = np.array([1/6] * 6)

# Expected value
E_X = np.sum(values * probs)

# Variance
E_X2 = np.sum(values**2 * probs)
Var_X = E_X2 - E_X**2
Std_X = np.sqrt(Var_X)

print("=== Theoretical Values (Fair Dice) ===")
print(f"Expected Value E[X] = {E_X:.4f}")
print(f"Variance Var(X) = {Var_X:.4f}")
print(f"Standard Deviation σ = {Std_X:.4f}")

# Compare with simulation
print("\n=== Simulated Values ===")
print(f"Sample Mean = {np.mean(dice_rolls):.4f}")
print(f"Sample Variance = {np.var(dice_rolls):.4f}")
print(f"Sample Std Dev = {np.std(dice_rolls):.4f}")

4. Common Probability Distributions¶

4.1 Bernoulli Distribution¶

Use case: Binary outcome (success/failure, yes/no)

Parameters: $p$ (probability of success)

PMF: $P(X = 1) = p$, $P(X = 0) = 1-p$

Examples: Coin flip, spam/not spam classification

# Bernoulli distribution
p = 0.7  # Probability of success
samples = bernoulli.rvs(p, size=1000)

plt.figure(figsize=(10, 5))
unique, counts = np.unique(samples, return_counts=True)
plt.bar(unique, counts/len(samples), alpha=0.7, color='teal', edgecolor='black')
plt.bar([0, 1], [1-p, p], alpha=0.3, color='red', edgecolor='red', linewidth=2, label='Theoretical')
plt.xlabel('Outcome', fontsize=12)
plt.ylabel('Probability', fontsize=12)
plt.title(f'Bernoulli Distribution (p={p})', fontsize=14)
plt.xticks([0, 1], ['Failure (0)', 'Success (1)'])
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3, axis='y')
plt.show()

print(f"Expected Value: {bernoulli.mean(p)}")
print(f"Variance: {bernoulli.var(p)}")

4.2 Binomial Distribution¶

Use case: Number of successes in n independent Bernoulli trials

Parameters: $n$ (number of trials), $p$ (probability of success)

PMF: $P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$

Examples: Number of heads in 10 coin flips, number of conversions in 100 website visits

# Binomial distribution
n = 20  # Number of trials
p = 0.3  # Probability of success

# Generate data
x = np.arange(0, n+1)
pmf = binom.pmf(x, n, p)
samples = binom.rvs(n, p, size=10000)

# Plot
plt.figure(figsize=(12, 5))
plt.hist(samples, bins=np.arange(0, n+2)-0.5, density=True, alpha=0.6, 
         color='skyblue', edgecolor='black', label='Simulated')
plt.plot(x, pmf, 'ro-', linewidth=2, markersize=8, label='Theoretical PMF')
plt.xlabel('Number of successes', fontsize=12)
plt.ylabel('Probability', fontsize=12)
plt.title(f'Binomial Distribution (n={n}, p={p})', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

print(f"Expected Value: {binom.mean(n, p):.2f}")
print(f"Variance: {binom.var(n, p):.2f}")
print(f"Standard Deviation: {binom.std(n, p):.2f}")

4.3 Normal (Gaussian) Distribution¶

Most important distribution in ML!

Parameters: $\mu$ (mean), $\sigma$ (standard deviation)

PDF: $f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}$

Properties:

Bell-shaped, symmetric around mean
68% of data within 1σ, 95% within 2σ, 99.7% within 3σ

Examples: Heights, measurement errors, many natural phenomena

# Normal distribution
mu = 0
sigma = 1

# Generate data
x = np.linspace(-4, 4, 1000)
pdf = norm.pdf(x, mu, sigma)
samples = norm.rvs(mu, sigma, size=10000)

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# PDF
ax1.plot(x, pdf, 'b-', linewidth=2, label='PDF')
ax1.fill_between(x, pdf, alpha=0.3)

# Mark standard deviations
for i in range(1, 4):
    ax1.axvline(x=mu + i*sigma, color='r', linestyle='--', alpha=0.5)
    ax1.axvline(x=mu - i*sigma, color='r', linestyle='--', alpha=0.5)

ax1.set_xlabel('x', fontsize=12)
ax1.set_ylabel('Probability Density', fontsize=12)
ax1.set_title(f'Normal Distribution (μ={mu}, σ={sigma})', fontsize=14)
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)

# Histogram of samples
ax2.hist(samples, bins=50, density=True, alpha=0.6, color='skyblue', edgecolor='black')
ax2.plot(x, pdf, 'r-', linewidth=2, label='Theoretical PDF')
ax2.set_xlabel('x', fontsize=12)
ax2.set_ylabel('Probability Density', fontsize=12)
ax2.set_title('Histogram of Samples vs Theoretical PDF', fontsize=14)
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 68-95-99.7 rule
print("=== Empirical Rule (68-95-99.7) ===")
within_1_sigma = np.sum((samples >= mu - sigma) & (samples <= mu + sigma)) / len(samples)
within_2_sigma = np.sum((samples >= mu - 2*sigma) & (samples <= mu + 2*sigma)) / len(samples)
within_3_sigma = np.sum((samples >= mu - 3*sigma) & (samples <= mu + 3*sigma)) / len(samples)

print(f"Within 1σ: {within_1_sigma:.2%} (theoretical: 68.27%)")
print(f"Within 2σ: {within_2_sigma:.2%} (theoretical: 95.45%)")
print(f"Within 3σ: {within_3_sigma:.2%} (theoretical: 99.73%)")

Comparing Multiple Normal Distributions¶

Understanding how the parameters $\mu$ (mean) and $\sigma$ (standard deviation) reshape the Gaussian bell curve is essential for ML. Shifting $\mu$ translates the curve left or right – this corresponds to changing the central tendency of your data (e.g., different classes having different average feature values). Changing $\sigma$ controls the spread: a small $\sigma$ produces a tall, narrow peak (high confidence), while a large $\sigma$ yields a flat, wide curve (high uncertainty). In Bayesian ML and generative models like VAEs, learning these two parameters per dimension is how the model captures both the typical value and the uncertainty of each feature.

# Compare normal distributions with different parameters
x = np.linspace(-10, 10, 1000)

plt.figure(figsize=(14, 6))

# Different means, same variance
plt.subplot(1, 2, 1)
for mu in [-2, 0, 2]:
    pdf = norm.pdf(x, mu, 1)
    plt.plot(x, pdf, linewidth=2, label=f'μ={mu}, σ=1')
plt.xlabel('x', fontsize=12)
plt.ylabel('Probability Density', fontsize=12)
plt.title('Different Means, Same Variance', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)

# Same mean, different variances
plt.subplot(1, 2, 2)
for sigma in [0.5, 1, 2]:
    pdf = norm.pdf(x, 0, sigma)
    plt.plot(x, pdf, linewidth=2, label=f'μ=0, σ={sigma}')
plt.xlabel('x', fontsize=12)
plt.ylabel('Probability Density', fontsize=12)
plt.title('Same Mean, Different Variances', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

5. Bayes’ Theorem¶

Foundation of probabilistic ML!

\[P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}\]

In ML context: $$P(\text{hypothesis}|\text{data}) = \frac{P(\text{data}|\text{hypothesis}) \cdot P(\text{hypothesis})}{P(\text{data})}$$

Prior: $P(\text{hypothesis})$ - What we believe before seeing data
Likelihood: $P(\text{data}|\text{hypothesis})$ - How likely is the data given hypothesis
Posterior: $P(\text{hypothesis}|\text{data})$ - Updated belief after seeing data

Example: Medical Test¶

This classic example reveals one of the most counter-intuitive results in probability – the base rate fallacy. Even a highly accurate test can produce misleading results when the condition being tested for is rare. The scenario below uses a disease prevalence of 1%, a test sensitivity of 95%, and a false positive rate of 10%. Applying Bayes’ theorem shows that a positive result only corresponds to roughly a 9% chance of actually having the disease, because the large number of healthy people generating false positives overwhelms the small number of true positives. In ML, this is directly analogous to class imbalance: a model that predicts the majority class 99% of the time can appear accurate while being useless for detecting the minority class.

# Bayes' Theorem: Medical Test Example

# Given probabilities
P_disease = 0.01              # Prior: 1% have disease
P_positive_given_disease = 0.95  # Sensitivity: Test detects 95% of cases
P_positive_given_healthy = 0.10  # False positive rate: 10%

# Calculate P(positive)
P_healthy = 1 - P_disease
P_positive = (P_positive_given_disease * P_disease + 
              P_positive_given_healthy * P_healthy)

# Apply Bayes' Theorem
P_disease_given_positive = (P_positive_given_disease * P_disease) / P_positive

print("=== Medical Test Bayes' Theorem ===")
print(f"Prior probability of disease: {P_disease:.2%}")
print(f"Probability of positive test: {P_positive:.2%}")
print(f"\nPosterior probability of disease given positive test: {P_disease_given_positive:.2%}")
print("\nSurprising result! Even with a positive test, only ~8.7% chance of having disease.")
print("This is due to the low base rate (1%) and false positives.")

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))

categories = ['Prior\n(Before Test)', 'Posterior\n(After Positive Test)']
probabilities = [P_disease * 100, P_disease_given_positive * 100]

bars = ax.bar(categories, probabilities, color=['skyblue', 'salmon'], 
              edgecolor='black', linewidth=2, alpha=0.7)
ax.set_ylabel('Probability of Disease (%)', fontsize=12)
ax.set_title('Bayes\' Theorem: Updating Beliefs with Evidence', fontsize=14)
ax.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, prob in zip(bars, probabilities):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{prob:.2f}%', ha='center', va='bottom', fontsize=12, fontweight='bold')

plt.show()

6. Central Limit Theorem (CLT)¶

One of the most important theorems in statistics!

Statement: The sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the original distribution.

Implications for ML:

Justifies many statistical methods
Explains why normal distribution is so common
Foundation for confidence intervals and hypothesis testing

# Demonstrate Central Limit Theorem

# Start with a NON-NORMAL distribution (uniform)
def sample_means(population_dist, sample_size, n_samples=10000):
    """
    Draw many samples and compute their means
    """
    means = []
    for _ in range(n_samples):
        sample = np.random.choice(population_dist, size=sample_size)
        means.append(np.mean(sample))
    return np.array(means)

# Create a uniform distribution (very non-normal!)
population = np.random.uniform(0, 10, size=100000)

# Sample with different sample sizes
sample_sizes = [2, 5, 10, 30]

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, n in enumerate(sample_sizes):
    means = sample_means(population, n)
    
    # Plot histogram
    axes[idx].hist(means, bins=50, density=True, alpha=0.7, 
                   color='skyblue', edgecolor='black')
    
    # Overlay normal distribution
    mu_means = np.mean(means)
    sigma_means = np.std(means)
    x = np.linspace(means.min(), means.max(), 100)
    axes[idx].plot(x, norm.pdf(x, mu_means, sigma_means), 
                   'r-', linewidth=2, label='Normal fit')
    
    axes[idx].set_title(f'Sample Size n = {n}', fontsize=12)
    axes[idx].set_xlabel('Sample Mean', fontsize=11)
    axes[idx].set_ylabel('Density', fontsize=11)
    axes[idx].legend(fontsize=10)
    axes[idx].grid(True, alpha=0.3)

fig.suptitle('Central Limit Theorem: Distribution of Sample Means\n' + 
             '(Original distribution is uniform, not normal!)', 
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Notice how the distribution becomes more normal as sample size increases!")

7. ML Application: Maximum Likelihood Estimation¶

Goal: Find parameters that maximize the probability of observing the data.

Likelihood function: $L(\theta | data) = P(data | \theta)$

Example: Estimate the probability $p$ of a biased coin from observed flips.

# Maximum Likelihood Estimation for coin flips

# True (unknown) probability
true_p = 0.7

# Generate observed data
n_flips = 100
observed_flips = np.random.random(n_flips) < true_p
n_heads = np.sum(observed_flips)

print(f"Observed: {n_heads} heads out of {n_flips} flips")

# Compute likelihood for different values of p
p_values = np.linspace(0, 1, 1000)
likelihoods = []

for p in p_values:
    # Binomial likelihood: P(k heads in n flips | p) ∝ p^k * (1-p)^(n-k)
    likelihood = p**n_heads * (1-p)**(n_flips - n_heads)
    likelihoods.append(likelihood)

likelihoods = np.array(likelihoods)

# MLE estimate
mle_p = n_heads / n_flips  # For binomial, MLE is simply the sample proportion

# Plot
plt.figure(figsize=(12, 6))
plt.plot(p_values, likelihoods, 'b-', linewidth=2, label='Likelihood')
plt.axvline(x=mle_p, color='r', linestyle='--', linewidth=2, 
            label=f'MLE: p = {mle_p:.3f}')
plt.axvline(x=true_p, color='g', linestyle='--', linewidth=2, 
            label=f'True p = {true_p}')
plt.xlabel('Probability of heads (p)', fontsize=12)
plt.ylabel('Likelihood', fontsize=12)
plt.title(f'Maximum Likelihood Estimation\n({n_heads} heads in {n_flips} flips)', 
          fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

print(f"\nMLE estimate: p = {mle_p:.3f}")
print(f"True value: p = {true_p}")
print(f"Error: {abs(mle_p - true_p):.3f}")

8. ML Application: Naive Bayes Classifier¶

A simple but powerful classification algorithm using Bayes’ theorem.

Assumption: Features are conditionally independent given the class (“naive”).

\[P(y|x_1, x_2, ..., x_n) \propto P(y) \prod_{i=1}^{n} P(x_i|y)\]

# Simple Naive Bayes example: Email spam classification
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Generate synthetic email data
np.random.seed(42)
n_samples = 1000

# Features: [word_frequency_1, word_frequency_2]
# Class 0: Not spam (ham)
# Class 1: Spam

# Ham emails: low frequency of spam words
ham_features = np.random.normal(loc=2, scale=1, size=(n_samples//2, 2))
ham_labels = np.zeros(n_samples//2)

# Spam emails: high frequency of spam words
spam_features = np.random.normal(loc=6, scale=1.5, size=(n_samples//2, 2))
spam_labels = np.ones(n_samples//2)

# Combine data
X = np.vstack([ham_features, spam_features])
y = np.hstack([ham_labels, spam_labels])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Naive Bayes classifier
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)

# Predictions
y_pred = nb_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Plot 1: Data points
ax1.scatter(X_train[y_train==0, 0], X_train[y_train==0, 1], 
           c='blue', alpha=0.6, label='Ham (training)', s=50)
ax1.scatter(X_train[y_train==1, 0], X_train[y_train==1, 1], 
           c='red', alpha=0.6, label='Spam (training)', s=50)
ax1.scatter(X_test[y_test==0, 0], X_test[y_test==0, 1], 
           c='blue', marker='s', alpha=0.8, label='Ham (test)', s=100, edgecolors='black')
ax1.scatter(X_test[y_test==1, 0], X_test[y_test==1, 1], 
           c='red', marker='s', alpha=0.8, label='Spam (test)', s=100, edgecolors='black')
ax1.set_xlabel('Feature 1 (e.g., "free" frequency)', fontsize=12)
ax1.set_ylabel('Feature 2 (e.g., "win" frequency)', fontsize=12)
ax1.set_title('Email Classification Data', fontsize=14)
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Plot 2: Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax2, 
            xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
ax2.set_xlabel('Predicted', fontsize=12)
ax2.set_ylabel('Actual', fontsize=12)
ax2.set_title(f'Confusion Matrix\nAccuracy: {accuracy:.2%}', fontsize=14)

plt.tight_layout()
plt.show()

print(f"Naive Bayes Classifier Accuracy: {accuracy:.2%}")
print(f"\nClass priors (learned from training data):")
print(f"  P(Ham) = {nb_classifier.class_prior_[0]:.3f}")
print(f"  P(Spam) = {nb_classifier.class_prior_[1]:.3f}")

9. Practice Exercises¶

These exercises build fluency with the core probability tools covered in this notebook. Exercise 1 applies the complement rule – often the fastest path to a solution when “at least one” appears in a problem. Exercise 2 uses the cumulative distribution function (CDF) of the normal distribution, a computation you will perform routinely when setting confidence thresholds or computing p-values. Exercise 3 asks you to implement variance from scratch, reinforcing the formula $\text{Var}(X) = E[(X - \mu)^2]$ that underlies every normalization and standardization step in data preprocessing.

# Exercise 1: Calculate probability of rolling at least one 6 in 4 dice rolls
# Hint: Use complement rule - easier to calculate P(no sixes)

# Your solution:
# P_no_six_single = ?
# P_no_six_four_rolls = ?
# P_at_least_one_six = ?

# Exercise 2: Given normal distribution with μ=100, σ=15
# Calculate: P(X > 120)

mu = 100
sigma = 15

# Your solution:
# Use norm.cdf() or norm.sf()
# probability = ?

# Exercise 3: Implement variance calculation from scratch
def calculate_variance(data):
    """
    Calculate variance without using np.var()
    Var(X) = E[(X - μ)^2]
    """
    # Your code here
    pass

# Test
# test_data = np.array([1, 2, 3, 4, 5])
# your_variance = calculate_variance(test_data)
# numpy_variance = np.var(test_data)
# print(f"Your result: {your_variance}")
# print(f"NumPy result: {numpy_variance}")

Summary¶

You’ve learned:

✅ Probability basics: Rules, complement, independence ✅ Random variables: Discrete and continuous ✅ Expected value and variance: Measures of center and spread ✅ Key distributions: Bernoulli, Binomial, Normal (Gaussian) ✅ Bayes’ theorem: Foundation of probabilistic ML ✅ Central Limit Theorem: Why normal distribution is so important ✅ Maximum Likelihood Estimation: Finding best parameters ✅ Naive Bayes: Practical probabilistic classifier

Key Takeaways for ML:¶

Probability quantifies uncertainty in predictions
Normal distribution appears everywhere due to CLT
Bayes’ theorem allows updating beliefs with evidence
MLE is a fundamental parameter estimation method
Understanding distributions helps choose appropriate models

Next Steps:¶

Study Maximum A Posteriori (MAP) estimation
Learn about multivariate distributions
Explore Bayesian inference and MCMC methods
Study information theory (entropy, KL divergence)
Practice with real ML datasets