Run this notebook: Open in Colab Open in Kaggle

Data Science Interview Prep: The 30 Questions That Actually Come Up¶

These questions appear in 80% of data science interviews at top tech companies. Each answer includes runnable code, the key intuition, and what interviewers are really testing.

Part 1 of 2 — Q1 to Q15: Statistics, Machine Learning concepts Part 2 of 2 — Q2: See 00_INTERVIEW_PREP_2.ipynb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier
from sklearn.tree import export_text
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, KFold, TimeSeriesSplit, learning_curve
from sklearn.metrics import (
    confusion_matrix, accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, average_precision_score, log_loss,
    brier_score_loss, calibration_curve
)
from sklearn.datasets import make_classification, make_regression

np.random.seed(42)
plt.rcParams['figure.figsize'] = (10, 4)
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.spines.right'] = False
print('Setup complete.')

Q1: Explain p-value to a non-technical stakeholder¶

Key answer: The p-value is P(data this extreme | H0 true) — the probability of seeing results at least this surprising if the null hypothesis were true.

Common mistake: “p-value is the probability that H0 is true.” This is wrong.

Plain English version: “If the coin were fair, how often would we see this many heads just by chance? If that chance is very small (<5%), we have evidence the coin is unfair.”

What interviewers want: Can you explain a technical concept without jargon? Do you understand the direction of the conditional probability?

# Simulate coin flip permutation test
observed_heads = 65  # out of 100 flips
n_flips = 100
n_simulations = 1000

# Under H0 (fair coin), simulate null distribution
null_distribution = np.random.binomial(n_flips, 0.5, n_simulations)

# Compute p-value (two-tailed)
p_value = np.mean(np.abs(null_distribution - 50) >= np.abs(observed_heads - 50))

fig, ax = plt.subplots()
ax.hist(null_distribution, bins=30, color='steelblue', alpha=0.7, label='Null distribution (fair coin)')
ax.axvline(observed_heads, color='crimson', linewidth=2, label=f'Observed: {observed_heads} heads')
ax.axvline(100 - observed_heads, color='crimson', linewidth=2, linestyle='--', label=f'Mirror: {100-observed_heads} heads')
ax.fill_betweenx([0, 80], 0, 100 - observed_heads, alpha=0.2, color='crimson')
ax.fill_betweenx([0, 80], observed_heads, 100, alpha=0.2, color='crimson')
ax.set_xlabel('Number of heads (out of 100)')
ax.set_ylabel('Count')
ax.set_title('Null Distribution: Coin Flip Permutation Test')
ax.legend()
plt.tight_layout()
plt.show()

print(f"Observed: {observed_heads} heads")
print(f"Simulated p-value: {p_value:.3f}")
print()
print("Plain-English interpretation:")
print(f"  'If the coin were truly fair, we would see {int(p_value*100)}% of experiments")
print(f"   produce results this extreme just by chance.'")
print(f"  Since p={p_value:.3f} < 0.05, we have evidence the coin is NOT fair.")

Q2: Type I vs Type II Error — Which is Worse?¶

	Predicted Positive	Predicted Negative
Actually Positive	True Positive	False Negative (Type II)
Actually Negative	False Positive (Type I)	True Negative

Type I error (α): False positive — you reject H0 when it’s actually true.
Type II error (β): False negative — you fail to reject H0 when it’s actually false.

“Which is worse?” depends on context:

Medical test for cancer: Type II is worse (miss a real cancer → patient untreated)
Spam filter: Type I is worse (block a real email → user frustrated)
Fraud detection: Type II is worse (miss fraud → financial loss)

What interviewers want: Domain awareness. Knowing the tradeoff is controlled by the classification threshold.

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
clf = LogisticRegression(random_state=42).fit(X, y)
probs = clf.predict_proba(X)[:, 1]

thresholds = [0.3, 0.5, 0.7]
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

for ax, thresh in zip(axes, thresholds):
    preds = (probs >= thresh).astype(int)
    cm = confusion_matrix(y, preds)
    fp, fn = cm[0, 1], cm[1, 0]
    im = ax.imshow(cm, cmap='Blues')
    for i in range(2):
        for j in range(2):
            ax.text(j, i, cm[i, j], ha='center', va='center', fontsize=14, fontweight='bold')
    ax.set_xticks([0, 1]); ax.set_yticks([0, 1])
    ax.set_xticklabels(['Pred Neg', 'Pred Pos'])
    ax.set_yticklabels(['Actual Neg', 'Actual Pos'])
    ax.set_title(f'Threshold = {thresh}\nFP={fp} (Type I), FN={fn} (Type II)')

plt.tight_layout()
plt.show()

print("In fraud detection:   prefer LOW Type II (catch fraud). Lower threshold.")
print("In spam filter:       prefer LOW Type I (don't block real emails). Higher threshold.")

Q3: Explain the Central Limit Theorem — Why Does It Matter for ML?¶

CLT: The sampling distribution of the mean of any distribution approaches Normal as sample size n → ∞, regardless of the population distribution.

Formula: If X has mean μ and variance σ², then $\bar{X}_n \sim N(\mu, \sigma^2/n)$ for large n.

Why it matters for ML:

Confidence intervals for model performance metrics are valid
Hypothesis tests (t-tests on model outputs) are justified
Bootstrapping works because resampled means converge to Normal
Error terms in linear regression can be assumed Normal even if residuals aren’t perfectly so

What interviewers want: Know it’s about means, not individual observations. Know the practical ML implications.

population = np.random.exponential(scale=2, size=100000)  # Highly right-skewed
sample_sizes = [5, 30, 100]
n_samples = 1000

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

axes[0].hist(population, bins=60, color='coral', density=True)
axes[0].set_title('Population\n(Exponential — skewed)')
axes[0].set_xlabel('Value')

for ax, n in zip(axes[1:], sample_sizes):
    sample_means = [np.mean(np.random.choice(population, n)) for _ in range(n_samples)]
    ax.hist(sample_means, bins=40, density=True, color='steelblue', alpha=0.7)
    mu, sigma = np.mean(sample_means), np.std(sample_means)
    x = np.linspace(mu - 4*sigma, mu + 4*sigma, 200)
    ax.plot(x, stats.norm.pdf(x, mu, sigma), 'r-', linewidth=2, label='Normal fit')
    ax.set_title(f'Sample Means (n={n})\nμ={mu:.2f}, σ={sigma:.3f}')
    ax.set_xlabel('Sample mean')
    ax.legend(fontsize=8)

plt.suptitle('Central Limit Theorem: Exponential Population → Normal Sample Means', y=1.02)
plt.tight_layout()
plt.show()
print("As n increases, sample means converge to Normal — even from a skewed population.")

Q4: A/B Test Gives p=0.04. What Do You Do?¶

NOT just “ship it”. The full checklist:

Sample Ratio Mismatch (SRM): Is traffic split actually 50/50? Imbalance signals instrumentation bugs.
Practical significance: Is the effect size meaningful? A 0.001% lift with p=0.04 on 10M users is statistically significant but operationally irrelevant.
Peeking: Did you stop early because p crossed 0.05? That inflates Type I error.
Multiple metrics: Did you test 20 metrics and only report the one that passed?
Novelty effects: Will the lift persist or fade after users adjust?
Segment analysis: Does the treatment harm any subgroup?

What interviewers want: You don’t treat p-values as a binary pass/fail. You think about the full decision.

from scipy import stats as scipy_stats

# Statistically significant but meaningless result
n_per_group = 500_000
control_mean, treatment_mean = 10.000, 10.010  # 0.1% lift
std = 5.0

control = np.random.normal(control_mean, std, n_per_group)
treatment = np.random.normal(treatment_mean, std, n_per_group)

t_stat, p_val = scipy_stats.ttest_ind(control, treatment)

# Cohen's d (effect size)
pooled_std = np.sqrt((np.std(control)**2 + np.std(treatment)**2) / 2)
cohens_d = (np.mean(treatment) - np.mean(control)) / pooled_std

# Visualization
fig, ax = plt.subplots()
x = np.linspace(control_mean - 3*std/np.sqrt(1000), control_mean + 3*std/np.sqrt(1000), 300)
sem = std / np.sqrt(n_per_group)
ax.plot(x, scipy_stats.norm.pdf(x, control_mean, sem), label='Control', color='steelblue')
ax.plot(x, scipy_stats.norm.pdf(x, treatment_mean, sem), label='Treatment', color='coral')
ax.set_xlabel('Mean value'); ax.set_ylabel('Density')
ax.set_title('Distributions of sample means (n=500,000 per group)')
ax.legend()
plt.tight_layout()
plt.show()

print(f"p-value:     {p_val:.4f}  → Statistically significant (p < 0.05)")
print(f"Cohen's d:   {cohens_d:.4f}  → Negligible effect (|d| < 0.2 is small)")
print(f"Lift:        {(treatment_mean - control_mean) / control_mean * 100:.2f}%")
print()
print("Conclusion: Effect size d=0.002 is negligible.")
print("            Statistical significance ≠ practical significance.")

Q5: Bayesian vs Frequentist Inference¶

	Frequentist	Bayesian
Parameters	Fixed, unknown constants	Random variables with distributions
Probability	Long-run frequency	Degree of belief
Output	Confidence interval, p-value	Posterior distribution
Prior info	Ignored	Explicitly incorporated

Frequentist: P(data | θ) — likelihood of data given fixed parameter Bayesian: P(θ | data) ∝ P(data | θ) × P(θ) — posterior ∝ likelihood × prior

Key insight: With enough data, both approaches converge. Bayesian shines when data is scarce and prior knowledge is valuable (A/B testing with short history, rare disease prevalence).

What interviewers want: Understand the philosophical difference AND when each is practical.

from scipy.stats import beta as beta_dist, binom

def plot_bayes_vs_freq(heads, total, ax, prior_a=1, prior_b=1):
    p_hat = heads / total
    # Frequentist: Wilson confidence interval
    ci = binom.interval(0.95, total, p_hat) if total > 0 else (0, 1)
    ci_lo, ci_hi = p_hat - 1.96*np.sqrt(p_hat*(1-p_hat)/total), p_hat + 1.96*np.sqrt(p_hat*(1-p_hat)/total)
    # Bayesian: Beta posterior
    post_a, post_b = prior_a + heads, prior_b + (total - heads)
    x = np.linspace(0, 1, 300)
    ax.plot(x, beta_dist.pdf(x, post_a, post_b), color='coral', label=f'Bayesian posterior Beta({post_a},{post_b})')
    ax.axvline(p_hat, color='steelblue', linestyle='--', linewidth=2, label=f'Freq MLE: {p_hat:.2f}')
    ax.axvspan(ci_lo, ci_hi, alpha=0.2, color='steelblue', label=f'95% CI [{ci_lo:.2f}, {ci_hi:.2f}]')
    ax.set_title(f'n={total}, {heads} heads')
    ax.legend(fontsize=7)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, (h, t) in zip(axes, [(7, 10), (70, 100), (700, 1000)]):
    plot_bayes_vs_freq(h, t, ax)

plt.suptitle('Bayesian vs Frequentist — Coin Flip (70% heads rate)', y=1.02)
plt.tight_layout()
plt.show()
print("As n grows, Bayesian posterior and frequentist CI converge to the same answer.")
print("With small n, the prior (Beta(1,1)=uniform) regularizes the Bayesian estimate.")

Q6: Multiple Comparisons Problem¶

Problem: If you run 20 tests at α=0.05, the probability of at least one false positive is: $$P(\text{at least 1 FP}) = 1 - (1-0.05)^{20} \approx 64\%$$

Solutions:

Bonferroni: α_adjusted = α / m (very conservative, controls FWER)
Benjamini-Hochberg (BH-FDR): Controls false discovery rate — less conservative, better power

When it matters in DS: A/B testing multiple metrics, feature selection via univariate tests, genomics, running many experiments.

What interviewers want: Awareness of the problem AND knowledge of at least one correction method.

from statsmodels.stats.multitest import multipletests

n_tests = 20
alpha = 0.05
n_experiments = 1000

# Simulate p-values under H0 (all null)
false_positive_counts = []
for _ in range(n_experiments):
    p_vals = np.random.uniform(0, 1, n_tests)  # Under H0, p-values are uniform
    false_positive_counts.append(np.sum(p_vals < alpha))

pct_any_fp = np.mean(np.array(false_positive_counts) > 0) * 100

# Show correction on one example
example_pvals = np.random.uniform(0, 1, n_tests)
_, bonf_pvals, _, _ = multipletests(example_pvals, alpha=alpha, method='bonferroni')
_, bh_pvals, _, _ = multipletests(example_pvals, alpha=alpha, method='fdr_bh')

print(f"Running {n_tests} tests at alpha={alpha} — all under H0 (no true effects)")
print(f"Expected false positives per experiment: {n_tests * alpha:.1f}")
print(f"% of experiments with >= 1 false positive: {pct_any_fp:.1f}%")
print(f"(Theoretical: {(1-(1-alpha)**n_tests)*100:.1f}%)")
print()
print(f"{'Method':<25} {'Alpha per test':<20} {'FP in example':<20}")
print("-" * 65)
print(f"{'No correction':<25} {alpha:<20.3f} {np.sum(example_pvals < alpha):<20}")
print(f"{'Bonferroni':<25} {alpha/n_tests:<20.4f} {np.sum(bonf_pvals < alpha):<20}")
print(f"{'BH-FDR':<25} {'variable':<20} {np.sum(bh_pvals < alpha):<20}")

Q7: Explain the Bias-Variance Tradeoff¶

Decomposition of prediction error: $$\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}$$

High Bias (underfitting): Model too simple — misses patterns in both train and test data.
High Variance (overfitting): Model too complex — memorizes training noise, fails on test data.
Sweet spot: Model complexity balanced to minimize total test error.

	Train Error	Test Error	Diagnosis
High Bias	High	High	Add features, increase complexity
High Variance	Low	High	Regularize, get more data, reduce features
Good fit	Low	~Low	Balanced

What interviewers want: Know the equation. Diagnose from train/test error. Know the remedy for each.

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

np.random.seed(42)
X = np.linspace(0, 4 * np.pi, 200).reshape(-1, 1)
y = np.sin(X.ravel()) + np.random.normal(0, 0.3, 200)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

degrees = [1, 5, 15]
colors = ['coral', 'steelblue', 'green']
labels = ['Degree 1 (High Bias)', 'Degree 5 (Sweet Spot)', 'Degree 15 (High Variance)']

X_plot = np.linspace(0, 4 * np.pi, 300).reshape(-1, 1)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].scatter(X_train, y_train, alpha=0.4, s=15, color='gray', label='Train data')
train_errs, test_errs = [], []
for deg, col, lab in zip(degrees, colors, labels):
    model = make_pipeline(PolynomialFeatures(deg), LinearRegression())
    model.fit(X_train, y_train)
    axes[0].plot(X_plot, model.predict(X_plot), color=col, linewidth=2, label=lab)
    train_errs.append(mean_squared_error(y_train, model.predict(X_train)))
    test_errs.append(mean_squared_error(y_test, model.predict(X_test)))
axes[0].set_title('Model Fits by Polynomial Degree'); axes[0].legend(fontsize=8)

axes[1].plot(degrees, train_errs, 'o-', color='steelblue', label='Train MSE')
axes[1].plot(degrees, test_errs, 'o-', color='coral', label='Test MSE')
axes[1].annotate('Underfitting', xy=(1, test_errs[0]), xytext=(1.3, test_errs[0]+0.02))
axes[1].annotate('Sweet spot', xy=(5, test_errs[1]), xytext=(5.5, test_errs[1]+0.02))
axes[1].annotate('Overfitting', xy=(15, test_errs[2]), xytext=(12, test_errs[2]+0.02))
axes[1].set_xlabel('Polynomial degree'); axes[1].set_ylabel('MSE')
axes[1].set_title('Bias-Variance Tradeoff: Train vs Test Error'); axes[1].legend()
plt.tight_layout(); plt.show()

Q8: How Does Gradient Descent Work? What Are the Variants?¶

Core idea: Iteratively move in the direction of steepest descent of the loss function. $$\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)$$

Variant	Data per step	Pros	Cons
Batch GD (BGD)	Full dataset	Stable, exact gradient	Slow, memory-heavy
Stochastic GD (SGD)	1 sample	Fast updates, escapes local minima	Noisy, unstable
Mini-batch GD	32–512 samples	Best of both	Hyperparameter (batch size)
Adam	Mini-batch	Adaptive per-param LR, momentum	More memory

Learning rate sensitivity: Too high → diverge. Too low → slow convergence. Adam adapts automatically.

What interviewers want: Understand why SGD is noisy (estimates gradient from 1 point). Know Adam uses first and second moment estimates.

def f(x): return x**4 - 3*x**3 + 2
def df(x): return 4*x**3 - 9*x**2

def run_gd(lr, n_steps=50, noisy=False, start=3.5):
    x, path = start, [start]
    for _ in range(n_steps):
        noise = np.random.normal(0, 0.5) if noisy else 0
        x = x - lr * (df(x) + noise)
        x = np.clip(x, -1, 5)
        path.append(x)
    return path

def run_adam(lr=0.1, n_steps=50, start=3.5, beta1=0.9, beta2=0.999, eps=1e-8):
    x, m, v, path = start, 0, 0, [start]
    for t in range(1, n_steps + 1):
        g = df(x)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g**2
        m_hat, v_hat = m / (1 - beta1**t), v / (1 - beta2**t)
        x = x - lr * m_hat / (np.sqrt(v_hat) + eps)
        path.append(x)
    return path

np.random.seed(0)
bgd = run_gd(lr=0.02, noisy=False)
sgd = run_gd(lr=0.02, noisy=True)
adam = run_adam(lr=0.1)
diverge = run_gd(lr=0.5, noisy=False)

x_range = np.linspace(-0.5, 4, 300)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].plot(x_range, f(x_range), 'k-', linewidth=2, label='f(x)')
for path, col, lab in [(bgd,'steelblue','BGD'), (sgd,'coral','SGD (noisy)'), (adam,'green','Adam')]:
    axes[0].plot([p for p in path], [f(p) for p in path], 'o-', color=col, alpha=0.6, markersize=3, label=lab)
axes[0].set_title('Convergence Paths on f(x) = x⁴ - 3x³ + 2'); axes[0].legend(); axes[0].set_xlabel('x'); axes[0].set_ylabel('f(x)')
steps = range(len(bgd))
axes[1].plot(steps, [f(p) for p in bgd], label='BGD lr=0.02', color='steelblue')
axes[1].plot(steps, [f(p) for p in sgd], label='SGD lr=0.02 (noisy)', color='coral', alpha=0.7)
axes[1].plot(steps, [f(p) for p in adam], label='Adam lr=0.1', color='green')
axes[1].set_title('Loss vs Iterations'); axes[1].set_xlabel('Step'); axes[1].set_ylabel('f(x)'); axes[1].legend()
plt.tight_layout(); plt.show()

Q9: L1 vs L2 Regularization¶

Both add a penalty to the loss function to prevent overfitting:

	L1 (Lasso)	L2 (Ridge)
Penalty	λ∑	βⱼ
Effect	Sparse — exact zeros	Shrinks all uniformly
Use when	Feature selection needed	All features relevant
Geometry	Diamond constraint → corners	Sphere constraint → no corners

Key intuition: L1 has corners at axes — solutions land on corners (coefficient = 0). L2 has no corners — solutions approach zero asymptotically.

Elastic Net: Combines both — good default when unsure.

What interviewers want: Geometric intuition for why L1 gives sparsity. Know when to use each.

from sklearn.linear_model import lasso_path, ridge_regression
from sklearn.preprocessing import StandardScaler

# Dataset: 5 relevant features, 15 noise features
np.random.seed(42)
n, p_true, p_noise = 200, 5, 15
X_feat = np.random.randn(n, p_true + p_noise)
true_coef = np.array([3, -2, 1.5, -1, 0.5] + [0]*p_noise)
y_feat = X_feat @ true_coef + np.random.randn(n) * 0.5
X_feat = StandardScaler().fit_transform(X_feat)

alphas = np.logspace(-3, 1, 100)
lasso_coefs = [Lasso(alpha=a, max_iter=5000).fit(X_feat, y_feat).coef_ for a in alphas]
ridge_coefs = [Ridge(alpha=a).fit(X_feat, y_feat).coef_ for a in alphas]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for coef_path, ax, title in [(lasso_coefs, axes[0], 'Lasso (L1) — Coefficient Paths'),
                              (ridge_coefs, axes[1], 'Ridge (L2) — Coefficient Paths')]:
    coef_arr = np.array(coef_path)
    for j in range(p_true):
        ax.plot(np.log10(alphas), coef_arr[:, j], linewidth=2, label=f'Relevant feat {j+1}')
    for j in range(p_true, p_true + p_noise):
        ax.plot(np.log10(alphas), coef_arr[:, j], 'gray', alpha=0.3, linewidth=0.8)
    ax.axhline(0, color='black', linestyle='--', linewidth=0.8)
    ax.set_xlabel('log10(alpha)'); ax.set_ylabel('Coefficient value')
    ax.set_title(title); ax.legend(fontsize=7)
    ax.invert_xaxis()

plt.tight_layout(); plt.show()
print("Lasso: noise feature coefficients hit exactly 0 as alpha increases (sparsity).")
print("Ridge: all coefficients shrink toward 0 but never reach exactly 0.")

Q10: How Does Random Forest Work?¶

Two sources of randomness → decorrelated trees:

Bootstrap sampling: Each tree trained on a random sample with replacement (~63% unique samples).
Feature subsampling: At each split, only a random subset of features considered (typically √p for classification).

Why averaging decorrelated trees reduces variance:

Var(mean of n correlated vars) = σ²/n when uncorrelated
Decorrelation via feature subsampling makes this approximation hold

Out-of-bag (OOB) error: The ~37% of samples not used to train each tree serve as a free validation set. No need for separate cross-validation.

Feature importance: Mean decrease in impurity (Gini/entropy) across all trees and splits — watch out for bias toward high-cardinality features.

What interviewers want: Both sources of randomness. Why it reduces variance not bias. OOB as built-in CV.

from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.model_selection import cross_val_score

X_rf, y_rf = make_classification(n_samples=1000, n_features=10, n_informative=5,
                                  n_redundant=2, random_state=42)
feature_names = [f'feat_{i}' for i in range(10)]

rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
rf.fit(X_rf, y_rf)

cv_scores = cross_val_score(rf, X_rf, y_rf, cv=5)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Feature importance
importances = rf.feature_importances_
sorted_idx = np.argsort(importances)[::-1]
axes[0].bar(range(10), importances[sorted_idx], color='steelblue')
axes[0].set_xticks(range(10))
axes[0].set_xticklabels([feature_names[i] for i in sorted_idx], rotation=45)
axes[0].set_title('Random Forest Feature Importances'); axes[0].set_ylabel('Mean Decrease in Impurity')

# OOB vs CV
axes[1].bar(['OOB Score', '5-Fold CV Mean'], [rf.oob_score_, cv_scores.mean()],
            yerr=[0, cv_scores.std()], color=['coral', 'steelblue'], capsize=5)
axes[1].set_ylim(0.8, 1.0); axes[1].set_ylabel('Accuracy')
axes[1].set_title('OOB Score ≈ Cross-Validation Score')
for i, v in enumerate([rf.oob_score_, cv_scores.mean()]):
    axes[1].text(i, v + 0.002, f'{v:.3f}', ha='center', fontweight='bold')

plt.tight_layout(); plt.show()
print(f"OOB Score:      {rf.oob_score_:.4f}")
print(f"5-Fold CV:      {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
print("Both are similar — OOB is a free and reliable validation estimate.")

Q11: Gradient Boosting vs Bagging¶

	Bagging (Random Forest)	Boosting (GBM/XGBoost)
Tree construction	Parallel, independent	Sequential, each corrects prior
Error reduced	Variance	Bias
Overfitting risk	Lower	Higher (needs early stopping)
Speed	Faster (parallelizable)	Slower (sequential)
Interpretability	Similar	Similar

Gradient boosting mechanics:

Fit a weak learner to data → get residuals
Fit next learner to residuals (= negative gradient of loss)
Add scaled prediction to ensemble
Repeat M times

Key hyperparameters: n_estimators, learning_rate, max_depth (shallower → less variance), subsample.

What interviewers want: Sequential vs parallel. Boosting reduces bias. Trade-off: needs careful tuning.

from sklearn.tree import DecisionTreeRegressor

# Manual 3-round gradient boosting on regression
np.random.seed(42)
X_gb = np.linspace(0, 2*np.pi, 100).reshape(-1, 1)
y_gb = np.sin(X_gb.ravel()) + np.random.randn(100) * 0.2

lr_gb = 0.5
ensemble_pred = np.zeros(100)
fig, axes = plt.subplots(1, 4, figsize=(18, 4))

for i, ax in enumerate(axes):
    residuals = y_gb - ensemble_pred
    tree = DecisionTreeRegressor(max_depth=3)
    tree.fit(X_gb, residuals)
    ensemble_pred += lr_gb * tree.predict(X_gb)
    ax.scatter(X_gb, y_gb, s=15, alpha=0.5, label='Data')
    ax.plot(X_gb, ensemble_pred, 'r-', linewidth=2, label=f'After {i+1} rounds')
    mse = np.mean((y_gb - ensemble_pred)**2)
    ax.set_title(f'Round {i+1}, MSE={mse:.3f}'); ax.legend(fontsize=7)

plt.suptitle('Manual Gradient Boosting: Fitting Residuals Sequentially', y=1.02)
plt.tight_layout(); plt.show()

# GBM vs RF learning curves
X_lc, y_lc = make_classification(n_samples=1000, n_features=20, random_state=42)
n_est_range = [10, 25, 50, 75, 100]
rf_scores = [cross_val_score(RandomForestClassifier(n_estimators=n, random_state=42), X_lc, y_lc, cv=3).mean() for n in n_est_range]
gbm_scores = [cross_val_score(GradientBoostingClassifier(n_estimators=n, random_state=42), X_lc, y_lc, cv=3).mean() for n in n_est_range]
print('n_estimators:', n_est_range)
print('RF CV scores: ', [f'{s:.3f}' for s in rf_scores])
print('GBM CV scores:', [f'{s:.3f}' for s in gbm_scores])

Q12: Classification Metrics — Which to Use When?¶

Metric         Use when
───────────────────────────────────────────────────────
Accuracy       Balanced classes, equal error costs
Precision      FP is costly (spam filter, irrelevant ads)
Recall         FN is costly (fraud, cancer detection)
F1             Balance precision/recall, imbalanced data
ROC-AUC        Ranking quality, insensitive to threshold
PR-AUC         Imbalanced data, care about positive class
Log loss       Probabilistic model evaluation

Accuracy paradox: On a 95/5 imbalanced dataset, always predicting the majority class gives 95% accuracy — but the model is useless for finding the minority class.

ROC-AUC vs PR-AUC: ROC-AUC is optimistic on imbalanced data (TN count inflates it). PR-AUC is more informative when the positive class is rare.

What interviewers want: Know when accuracy is misleading. Know at least one imbalance-aware metric. Understand threshold-based vs threshold-free metrics.

# 95/5 imbalanced dataset
X_imb, y_imb = make_classification(n_samples=2000, weights=[0.95, 0.05],
                                    n_features=10, random_state=42)
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(X_imb, y_imb, test_size=0.3, random_state=42)

clf_imb = LogisticRegression(random_state=42).fit(X_tr, y_tr)
y_pred = clf_imb.predict(X_te)
y_prob = clf_imb.predict_proba(X_te)[:, 1]
y_majority = np.zeros_like(y_te)  # Always predict majority class

print(f"Class distribution: {np.bincount(y_te)} (neg/pos)")
print()
print(f"{'Metric':<15} {'LR Model':>12} {'Always Negative':>18}")
print("-" * 48)
print(f"{'Accuracy':<15} {accuracy_score(y_te, y_pred):>12.3f} {accuracy_score(y_te, y_majority):>18.3f}")
print(f"{'Precision':<15} {precision_score(y_te, y_pred, zero_division=0):>12.3f} {precision_score(y_te, y_majority, zero_division=0):>18.3f}")
print(f"{'Recall':<15} {recall_score(y_te, y_pred, zero_division=0):>12.3f} {recall_score(y_te, y_majority, zero_division=0):>18.3f}")
print(f"{'F1':<15} {f1_score(y_te, y_pred, zero_division=0):>12.3f} {f1_score(y_te, y_majority, zero_division=0):>18.3f}")
print(f"{'ROC-AUC':<15} {roc_auc_score(y_te, y_prob):>12.3f} {'0.500':>18}")
print(f"{'PR-AUC':<15} {average_precision_score(y_te, y_prob):>12.3f} {'0.050 (base)':>18}")
print()
print("Accuracy paradox: 'Always negative' model is 95% accurate but catches 0 fraud cases.")

Q13: Class Imbalance — How Do You Handle It?¶

Strategies (in order of preference):

Change the metric first — don’t optimize accuracy. Use PR-AUC, F1, recall.
Class weights (class_weight='balanced') — penalize misclassifying minority class more.
Threshold tuning — move decision boundary below 0.5 to increase recall.
Resampling:
- SMOTE: Synthesize new minority samples via interpolation
- Undersampling: Remove majority samples (risk losing information)
Algorithm choice — tree-based methods with class_weight often outperform resampling.

Accuracy paradox: A model that always predicts the majority class gets 97% accuracy on a 97/3 split — but has zero predictive value.

What interviewers want: Know multiple strategies. Know SMOTE. Know that resampling is often less effective than people think. Always change the metric first.

try:
    from imblearn.over_sampling import SMOTE
    smote_available = True
except ImportError:
    smote_available = False

X_ib, y_ib = make_classification(n_samples=3000, weights=[0.97, 0.03],
                                  n_features=10, n_informative=5, random_state=42)
X_ibtr, X_ibte, y_ibtr, y_ibte = train_test_split(X_ib, y_ib, test_size=0.3, random_state=42)

results = {}

# No fix
m = LogisticRegression(random_state=42).fit(X_ibtr, y_ibtr)
results['No fix'] = average_precision_score(y_ibte, m.predict_proba(X_ibte)[:, 1])

# Class weight
m = LogisticRegression(class_weight='balanced', random_state=42).fit(X_ibtr, y_ibtr)
results['class_weight=balanced'] = average_precision_score(y_ibte, m.predict_proba(X_ibte)[:, 1])

# Threshold tuning
m = LogisticRegression(random_state=42).fit(X_ibtr, y_ibtr)
prob = m.predict_proba(X_ibte)[:, 1]
results['Threshold=0.2'] = average_precision_score(y_ibte, prob)

# SMOTE
if smote_available:
    X_sm, y_sm = SMOTE(random_state=42).fit_resample(X_ibtr, y_ibtr)
    m = LogisticRegression(random_state=42).fit(X_sm, y_sm)
    results['SMOTE'] = average_precision_score(y_ibte, m.predict_proba(X_ibte)[:, 1])
else:
    results['SMOTE (not installed)'] = float('nan')

print(f"Dataset: {np.bincount(y_ibtr)} train samples (neg/pos) — 97/3 split")
print(f"Accuracy paradox: always predict negative = {np.mean(y_ibte==0)*100:.1f}% accuracy")
print()
print(f"{'Method':<30} {'PR-AUC':>10}")
print("-" * 42)
for method, score in results.items():
    print(f"{method:<30} {score:>10.4f}")

Q14: Cross-Validation — When NOT to Use k-Fold?¶

Standard k-fold assumes: Observations are i.i.d. (independent and identically distributed).

When this breaks:

Time series data: Future data leaks into training folds. Use TimeSeriesSplit — always train on past, validate on future.
Grouped data: Multiple rows from same entity (patient, user). Use GroupKFold — keep all rows from one entity in same fold.
Very small datasets: With n=50 and 5-fold, each fold has 10 samples — noisy estimates. Use LOOCV instead.
Spatial data: Nearby observations are correlated. Use spatial cross-validation.

What interviewers want: Know the i.i.d. assumption. Immediately flag time series as a case requiring special handling. Demonstrate awareness of data leakage.

# Simulate time series with temporal pattern
np.random.seed(42)
n_ts = 300
t = np.arange(n_ts)
trend = 0.05 * t
seasonal = 10 * np.sin(2 * np.pi * t / 30)
noise = np.random.randn(n_ts) * 2
y_ts = trend + seasonal + noise

# Features: lagged values (but we'll use trivial features to show leakage)
X_ts = np.column_stack([t, t**2, np.sin(2*np.pi*t/30), np.cos(2*np.pi*t/30)])

model_ts = Ridge(alpha=1.0)

# Regular k-fold (wrong for time series — causes data leakage)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
kf_scores = cross_val_score(model_ts, X_ts, y_ts, cv=kf, scoring='r2')

# TimeSeriesSplit (correct)
tss = TimeSeriesSplit(n_splits=5)
tss_scores = cross_val_score(model_ts, X_ts, y_ts, cv=tss, scoring='r2')

fig, axes = plt.subplots(1, 2, figsize=(14, 4))
axes[0].boxplot([kf_scores, tss_scores], labels=['K-Fold (leaky)', 'TimeSeriesSplit (correct)'])
axes[0].set_ylabel('R² score'); axes[0].set_title('CV R² Score Distribution')
axes[0].axhline(0, color='red', linestyle='--', alpha=0.5)
axes[1].plot(t, y_ts, alpha=0.5, label='Time series')
axes[1].set_title('Sample Time Series Data'); axes[1].set_xlabel('Time step'); axes[1].legend()
plt.tight_layout(); plt.show()

print(f"K-Fold CV R²:          {kf_scores.mean():.3f} ± {kf_scores.std():.3f}  ← Leakage inflates score")
print(f"TimeSeriesSplit R²:    {tss_scores.mean():.3f} ± {tss_scores.std():.3f}  ← Honest estimate")
print()
print("When to avoid standard k-fold:")
print("  - Time series data (use TimeSeriesSplit)")
print("  - Grouped data (use GroupKFold)")
print("  - Very small datasets (use LOOCV)")

Q15: SQL Window Functions — ROW_NUMBER and LAG¶

Window functions perform calculations across a set of table rows related to the current row — without collapsing rows like GROUP BY does.

function() OVER (
    PARTITION BY column   -- "restart" for each group
    ORDER BY column       -- defines row order within window
)

Common window functions:

Function	Purpose
`ROW_NUMBER()`	Sequential rank within partition (no ties)
`RANK()`	Rank with gaps for ties
`DENSE_RANK()`	Rank without gaps
`LAG(col, n)`	Value from n rows before in window
`LEAD(col, n)`	Value from n rows after in window
`SUM() OVER`	Running/cumulative sum

What interviewers want: Know PARTITION BY vs ORDER BY. Be able to write a LAG for day-over-day or order-over-order comparisons. Know that window functions don’t reduce row count.

import sqlite3

# Create synthetic orders table
conn = sqlite3.connect(':memory:')
conn.execute('''
    CREATE TABLE orders (
        order_id INTEGER, customer_id INTEGER,
        order_date TEXT, amount REAL
    )
''')
orders_data = [
    (1, 101, '2024-01-05', 120.0), (2, 101, '2024-01-15', 85.0),
    (3, 101, '2024-02-03', 200.0), (4, 102, '2024-01-08', 55.0),
    (5, 102, '2024-01-20', 310.0), (6, 102, '2024-02-14', 90.0),
    (7, 103, '2024-01-12', 175.0), (8, 103, '2024-02-01', 220.0),
]
conn.executemany('INSERT INTO orders VALUES (?, ?, ?, ?)', orders_data)
conn.commit()

# ROW_NUMBER + LAG window functions
query = '''
SELECT
    order_id,
    customer_id,
    order_date,
    amount,
    ROW_NUMBER() OVER (
        PARTITION BY customer_id
        ORDER BY order_date
    ) AS order_rank,
    LAG(amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
    ) AS prev_amount,
    ROUND(amount - LAG(amount) OVER (
        PARTITION BY customer_id ORDER BY order_date
    ), 2) AS amount_change
FROM orders
ORDER BY customer_id, order_date
'''

result = pd.read_sql_query(query, conn)
conn.close()
print(result.to_string(index=False))
print()
print("ROW_NUMBER: restarts from 1 for each customer_id partition.")
print("LAG:        NULL for first row per customer (no previous order).")