Data Science Interview Prep: The 30 Questions That Actually Come UpΒΆ
These questions appear in 80% of data science interviews at top tech companies. Each answer includes runnable code, the key intuition, and what interviewers are really testing.
Part 1 of 2 β Q1 to Q15: Statistics, Machine Learning concepts
Part 2 of 2 β Q2: See 00_INTERVIEW_PREP_2.ipynb
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier
from sklearn.tree import export_text
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, KFold, TimeSeriesSplit, learning_curve
from sklearn.metrics import (
confusion_matrix, accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, average_precision_score, log_loss,
brier_score_loss, calibration_curve
)
from sklearn.datasets import make_classification, make_regression
np.random.seed(42)
plt.rcParams['figure.figsize'] = (10, 4)
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.spines.right'] = False
print('Setup complete.')
Q1: Explain p-value to a non-technical stakeholderΒΆ
Key answer: The p-value is P(data this extreme | H0 true) β the probability of seeing results at least this surprising if the null hypothesis were true.
Common mistake: βp-value is the probability that H0 is true.β This is wrong.
Plain English version: βIf the coin were fair, how often would we see this many heads just by chance? If that chance is very small (<5%), we have evidence the coin is unfair.β
What interviewers want: Can you explain a technical concept without jargon? Do you understand the direction of the conditional probability?
# Simulate coin flip permutation test
observed_heads = 65 # out of 100 flips
n_flips = 100
n_simulations = 1000
# Under H0 (fair coin), simulate null distribution
null_distribution = np.random.binomial(n_flips, 0.5, n_simulations)
# Compute p-value (two-tailed)
p_value = np.mean(np.abs(null_distribution - 50) >= np.abs(observed_heads - 50))
fig, ax = plt.subplots()
ax.hist(null_distribution, bins=30, color='steelblue', alpha=0.7, label='Null distribution (fair coin)')
ax.axvline(observed_heads, color='crimson', linewidth=2, label=f'Observed: {observed_heads} heads')
ax.axvline(100 - observed_heads, color='crimson', linewidth=2, linestyle='--', label=f'Mirror: {100-observed_heads} heads')
ax.fill_betweenx([0, 80], 0, 100 - observed_heads, alpha=0.2, color='crimson')
ax.fill_betweenx([0, 80], observed_heads, 100, alpha=0.2, color='crimson')
ax.set_xlabel('Number of heads (out of 100)')
ax.set_ylabel('Count')
ax.set_title('Null Distribution: Coin Flip Permutation Test')
ax.legend()
plt.tight_layout()
plt.show()
print(f"Observed: {observed_heads} heads")
print(f"Simulated p-value: {p_value:.3f}")
print()
print("Plain-English interpretation:")
print(f" 'If the coin were truly fair, we would see {int(p_value*100)}% of experiments")
print(f" produce results this extreme just by chance.'")
print(f" Since p={p_value:.3f} < 0.05, we have evidence the coin is NOT fair.")
Q2: Type I vs Type II Error β Which is Worse?ΒΆ
Predicted Positive |
Predicted Negative |
|
|---|---|---|
Actually Positive |
True Positive |
False Negative (Type II) |
Actually Negative |
False Positive (Type I) |
True Negative |
Type I error (Ξ±): False positive β you reject H0 when itβs actually true.
Type II error (Ξ²): False negative β you fail to reject H0 when itβs actually false.
βWhich is worse?β depends on context:
Medical test for cancer: Type II is worse (miss a real cancer β patient untreated)
Spam filter: Type I is worse (block a real email β user frustrated)
Fraud detection: Type II is worse (miss fraud β financial loss)
What interviewers want: Domain awareness. Knowing the tradeoff is controlled by the classification threshold.
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
clf = LogisticRegression(random_state=42).fit(X, y)
probs = clf.predict_proba(X)[:, 1]
thresholds = [0.3, 0.5, 0.7]
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
for ax, thresh in zip(axes, thresholds):
preds = (probs >= thresh).astype(int)
cm = confusion_matrix(y, preds)
fp, fn = cm[0, 1], cm[1, 0]
im = ax.imshow(cm, cmap='Blues')
for i in range(2):
for j in range(2):
ax.text(j, i, cm[i, j], ha='center', va='center', fontsize=14, fontweight='bold')
ax.set_xticks([0, 1]); ax.set_yticks([0, 1])
ax.set_xticklabels(['Pred Neg', 'Pred Pos'])
ax.set_yticklabels(['Actual Neg', 'Actual Pos'])
ax.set_title(f'Threshold = {thresh}\nFP={fp} (Type I), FN={fn} (Type II)')
plt.tight_layout()
plt.show()
print("In fraud detection: prefer LOW Type II (catch fraud). Lower threshold.")
print("In spam filter: prefer LOW Type I (don't block real emails). Higher threshold.")
Q3: Explain the Central Limit Theorem β Why Does It Matter for ML?ΒΆ
CLT: The sampling distribution of the mean of any distribution approaches Normal as sample size n β β, regardless of the population distribution.
Formula: If X has mean ΞΌ and variance ΟΒ², then \(\bar{X}_n \sim N(\mu, \sigma^2/n)\) for large n.
Why it matters for ML:
Confidence intervals for model performance metrics are valid
Hypothesis tests (t-tests on model outputs) are justified
Bootstrapping works because resampled means converge to Normal
Error terms in linear regression can be assumed Normal even if residuals arenβt perfectly so
What interviewers want: Know itβs about means, not individual observations. Know the practical ML implications.
population = np.random.exponential(scale=2, size=100000) # Highly right-skewed
sample_sizes = [5, 30, 100]
n_samples = 1000
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
axes[0].hist(population, bins=60, color='coral', density=True)
axes[0].set_title('Population\n(Exponential β skewed)')
axes[0].set_xlabel('Value')
for ax, n in zip(axes[1:], sample_sizes):
sample_means = [np.mean(np.random.choice(population, n)) for _ in range(n_samples)]
ax.hist(sample_means, bins=40, density=True, color='steelblue', alpha=0.7)
mu, sigma = np.mean(sample_means), np.std(sample_means)
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 200)
ax.plot(x, stats.norm.pdf(x, mu, sigma), 'r-', linewidth=2, label='Normal fit')
ax.set_title(f'Sample Means (n={n})\nΞΌ={mu:.2f}, Ο={sigma:.3f}')
ax.set_xlabel('Sample mean')
ax.legend(fontsize=8)
plt.suptitle('Central Limit Theorem: Exponential Population β Normal Sample Means', y=1.02)
plt.tight_layout()
plt.show()
print("As n increases, sample means converge to Normal β even from a skewed population.")
Q4: A/B Test Gives p=0.04. What Do You Do?ΒΆ
NOT just βship itβ. The full checklist:
Sample Ratio Mismatch (SRM): Is traffic split actually 50/50? Imbalance signals instrumentation bugs.
Practical significance: Is the effect size meaningful? A 0.001% lift with p=0.04 on 10M users is statistically significant but operationally irrelevant.
Peeking: Did you stop early because p crossed 0.05? That inflates Type I error.
Multiple metrics: Did you test 20 metrics and only report the one that passed?
Novelty effects: Will the lift persist or fade after users adjust?
Segment analysis: Does the treatment harm any subgroup?
What interviewers want: You donβt treat p-values as a binary pass/fail. You think about the full decision.
from scipy import stats as scipy_stats
# Statistically significant but meaningless result
n_per_group = 500_000
control_mean, treatment_mean = 10.000, 10.010 # 0.1% lift
std = 5.0
control = np.random.normal(control_mean, std, n_per_group)
treatment = np.random.normal(treatment_mean, std, n_per_group)
t_stat, p_val = scipy_stats.ttest_ind(control, treatment)
# Cohen's d (effect size)
pooled_std = np.sqrt((np.std(control)**2 + np.std(treatment)**2) / 2)
cohens_d = (np.mean(treatment) - np.mean(control)) / pooled_std
# Visualization
fig, ax = plt.subplots()
x = np.linspace(control_mean - 3*std/np.sqrt(1000), control_mean + 3*std/np.sqrt(1000), 300)
sem = std / np.sqrt(n_per_group)
ax.plot(x, scipy_stats.norm.pdf(x, control_mean, sem), label='Control', color='steelblue')
ax.plot(x, scipy_stats.norm.pdf(x, treatment_mean, sem), label='Treatment', color='coral')
ax.set_xlabel('Mean value'); ax.set_ylabel('Density')
ax.set_title('Distributions of sample means (n=500,000 per group)')
ax.legend()
plt.tight_layout()
plt.show()
print(f"p-value: {p_val:.4f} β Statistically significant (p < 0.05)")
print(f"Cohen's d: {cohens_d:.4f} β Negligible effect (|d| < 0.2 is small)")
print(f"Lift: {(treatment_mean - control_mean) / control_mean * 100:.2f}%")
print()
print("Conclusion: Effect size d=0.002 is negligible.")
print(" Statistical significance β practical significance.")
Q5: Bayesian vs Frequentist InferenceΒΆ
Frequentist |
Bayesian |
|
|---|---|---|
Parameters |
Fixed, unknown constants |
Random variables with distributions |
Probability |
Long-run frequency |
Degree of belief |
Output |
Confidence interval, p-value |
Posterior distribution |
Prior info |
Ignored |
Explicitly incorporated |
Frequentist: P(data | ΞΈ) β likelihood of data given fixed parameter Bayesian: P(ΞΈ | data) β P(data | ΞΈ) Γ P(ΞΈ) β posterior β likelihood Γ prior
Key insight: With enough data, both approaches converge. Bayesian shines when data is scarce and prior knowledge is valuable (A/B testing with short history, rare disease prevalence).
What interviewers want: Understand the philosophical difference AND when each is practical.
from scipy.stats import beta as beta_dist, binom
def plot_bayes_vs_freq(heads, total, ax, prior_a=1, prior_b=1):
p_hat = heads / total
# Frequentist: Wilson confidence interval
ci = binom.interval(0.95, total, p_hat) if total > 0 else (0, 1)
ci_lo, ci_hi = p_hat - 1.96*np.sqrt(p_hat*(1-p_hat)/total), p_hat + 1.96*np.sqrt(p_hat*(1-p_hat)/total)
# Bayesian: Beta posterior
post_a, post_b = prior_a + heads, prior_b + (total - heads)
x = np.linspace(0, 1, 300)
ax.plot(x, beta_dist.pdf(x, post_a, post_b), color='coral', label=f'Bayesian posterior Beta({post_a},{post_b})')
ax.axvline(p_hat, color='steelblue', linestyle='--', linewidth=2, label=f'Freq MLE: {p_hat:.2f}')
ax.axvspan(ci_lo, ci_hi, alpha=0.2, color='steelblue', label=f'95% CI [{ci_lo:.2f}, {ci_hi:.2f}]')
ax.set_title(f'n={total}, {heads} heads')
ax.legend(fontsize=7)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, (h, t) in zip(axes, [(7, 10), (70, 100), (700, 1000)]):
plot_bayes_vs_freq(h, t, ax)
plt.suptitle('Bayesian vs Frequentist β Coin Flip (70% heads rate)', y=1.02)
plt.tight_layout()
plt.show()
print("As n grows, Bayesian posterior and frequentist CI converge to the same answer.")
print("With small n, the prior (Beta(1,1)=uniform) regularizes the Bayesian estimate.")
Q6: Multiple Comparisons ProblemΒΆ
Problem: If you run 20 tests at Ξ±=0.05, the probability of at least one false positive is: $\(P(\text{at least 1 FP}) = 1 - (1-0.05)^{20} \approx 64\%\)$
Solutions:
Bonferroni: Ξ±_adjusted = Ξ± / m (very conservative, controls FWER)
Benjamini-Hochberg (BH-FDR): Controls false discovery rate β less conservative, better power
When it matters in DS: A/B testing multiple metrics, feature selection via univariate tests, genomics, running many experiments.
What interviewers want: Awareness of the problem AND knowledge of at least one correction method.
from statsmodels.stats.multitest import multipletests
n_tests = 20
alpha = 0.05
n_experiments = 1000
# Simulate p-values under H0 (all null)
false_positive_counts = []
for _ in range(n_experiments):
p_vals = np.random.uniform(0, 1, n_tests) # Under H0, p-values are uniform
false_positive_counts.append(np.sum(p_vals < alpha))
pct_any_fp = np.mean(np.array(false_positive_counts) > 0) * 100
# Show correction on one example
example_pvals = np.random.uniform(0, 1, n_tests)
_, bonf_pvals, _, _ = multipletests(example_pvals, alpha=alpha, method='bonferroni')
_, bh_pvals, _, _ = multipletests(example_pvals, alpha=alpha, method='fdr_bh')
print(f"Running {n_tests} tests at alpha={alpha} β all under H0 (no true effects)")
print(f"Expected false positives per experiment: {n_tests * alpha:.1f}")
print(f"% of experiments with >= 1 false positive: {pct_any_fp:.1f}%")
print(f"(Theoretical: {(1-(1-alpha)**n_tests)*100:.1f}%)")
print()
print(f"{'Method':<25} {'Alpha per test':<20} {'FP in example':<20}")
print("-" * 65)
print(f"{'No correction':<25} {alpha:<20.3f} {np.sum(example_pvals < alpha):<20}")
print(f"{'Bonferroni':<25} {alpha/n_tests:<20.4f} {np.sum(bonf_pvals < alpha):<20}")
print(f"{'BH-FDR':<25} {'variable':<20} {np.sum(bh_pvals < alpha):<20}")
Q7: Explain the Bias-Variance TradeoffΒΆ
Decomposition of prediction error: $\(\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}\)$
High Bias (underfitting): Model too simple β misses patterns in both train and test data.
High Variance (overfitting): Model too complex β memorizes training noise, fails on test data.
Sweet spot: Model complexity balanced to minimize total test error.
Train Error |
Test Error |
Diagnosis |
|
|---|---|---|---|
High Bias |
High |
High |
Add features, increase complexity |
High Variance |
Low |
High |
Regularize, get more data, reduce features |
Good fit |
Low |
~Low |
Balanced |
What interviewers want: Know the equation. Diagnose from train/test error. Know the remedy for each.
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
np.random.seed(42)
X = np.linspace(0, 4 * np.pi, 200).reshape(-1, 1)
y = np.sin(X.ravel()) + np.random.normal(0, 0.3, 200)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
degrees = [1, 5, 15]
colors = ['coral', 'steelblue', 'green']
labels = ['Degree 1 (High Bias)', 'Degree 5 (Sweet Spot)', 'Degree 15 (High Variance)']
X_plot = np.linspace(0, 4 * np.pi, 300).reshape(-1, 1)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].scatter(X_train, y_train, alpha=0.4, s=15, color='gray', label='Train data')
train_errs, test_errs = [], []
for deg, col, lab in zip(degrees, colors, labels):
model = make_pipeline(PolynomialFeatures(deg), LinearRegression())
model.fit(X_train, y_train)
axes[0].plot(X_plot, model.predict(X_plot), color=col, linewidth=2, label=lab)
train_errs.append(mean_squared_error(y_train, model.predict(X_train)))
test_errs.append(mean_squared_error(y_test, model.predict(X_test)))
axes[0].set_title('Model Fits by Polynomial Degree'); axes[0].legend(fontsize=8)
axes[1].plot(degrees, train_errs, 'o-', color='steelblue', label='Train MSE')
axes[1].plot(degrees, test_errs, 'o-', color='coral', label='Test MSE')
axes[1].annotate('Underfitting', xy=(1, test_errs[0]), xytext=(1.3, test_errs[0]+0.02))
axes[1].annotate('Sweet spot', xy=(5, test_errs[1]), xytext=(5.5, test_errs[1]+0.02))
axes[1].annotate('Overfitting', xy=(15, test_errs[2]), xytext=(12, test_errs[2]+0.02))
axes[1].set_xlabel('Polynomial degree'); axes[1].set_ylabel('MSE')
axes[1].set_title('Bias-Variance Tradeoff: Train vs Test Error'); axes[1].legend()
plt.tight_layout(); plt.show()
Q8: How Does Gradient Descent Work? What Are the Variants?ΒΆ
Core idea: Iteratively move in the direction of steepest descent of the loss function. $\(\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)\)$
Variant |
Data per step |
Pros |
Cons |
|---|---|---|---|
Batch GD (BGD) |
Full dataset |
Stable, exact gradient |
Slow, memory-heavy |
Stochastic GD (SGD) |
1 sample |
Fast updates, escapes local minima |
Noisy, unstable |
Mini-batch GD |
32β512 samples |
Best of both |
Hyperparameter (batch size) |
Adam |
Mini-batch |
Adaptive per-param LR, momentum |
More memory |
Learning rate sensitivity: Too high β diverge. Too low β slow convergence. Adam adapts automatically.
What interviewers want: Understand why SGD is noisy (estimates gradient from 1 point). Know Adam uses first and second moment estimates.
def f(x): return x**4 - 3*x**3 + 2
def df(x): return 4*x**3 - 9*x**2
def run_gd(lr, n_steps=50, noisy=False, start=3.5):
x, path = start, [start]
for _ in range(n_steps):
noise = np.random.normal(0, 0.5) if noisy else 0
x = x - lr * (df(x) + noise)
x = np.clip(x, -1, 5)
path.append(x)
return path
def run_adam(lr=0.1, n_steps=50, start=3.5, beta1=0.9, beta2=0.999, eps=1e-8):
x, m, v, path = start, 0, 0, [start]
for t in range(1, n_steps + 1):
g = df(x)
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g**2
m_hat, v_hat = m / (1 - beta1**t), v / (1 - beta2**t)
x = x - lr * m_hat / (np.sqrt(v_hat) + eps)
path.append(x)
return path
np.random.seed(0)
bgd = run_gd(lr=0.02, noisy=False)
sgd = run_gd(lr=0.02, noisy=True)
adam = run_adam(lr=0.1)
diverge = run_gd(lr=0.5, noisy=False)
x_range = np.linspace(-0.5, 4, 300)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].plot(x_range, f(x_range), 'k-', linewidth=2, label='f(x)')
for path, col, lab in [(bgd,'steelblue','BGD'), (sgd,'coral','SGD (noisy)'), (adam,'green','Adam')]:
axes[0].plot([p for p in path], [f(p) for p in path], 'o-', color=col, alpha=0.6, markersize=3, label=lab)
axes[0].set_title('Convergence Paths on f(x) = xβ΄ - 3xΒ³ + 2'); axes[0].legend(); axes[0].set_xlabel('x'); axes[0].set_ylabel('f(x)')
steps = range(len(bgd))
axes[1].plot(steps, [f(p) for p in bgd], label='BGD lr=0.02', color='steelblue')
axes[1].plot(steps, [f(p) for p in sgd], label='SGD lr=0.02 (noisy)', color='coral', alpha=0.7)
axes[1].plot(steps, [f(p) for p in adam], label='Adam lr=0.1', color='green')
axes[1].set_title('Loss vs Iterations'); axes[1].set_xlabel('Step'); axes[1].set_ylabel('f(x)'); axes[1].legend()
plt.tight_layout(); plt.show()
Q9: L1 vs L2 RegularizationΒΆ
Both add a penalty to the loss function to prevent overfitting:
L1 (Lasso) |
L2 (Ridge) |
|
|---|---|---|
Penalty |
Ξ»β |
Ξ²β±Ό |
Effect |
Sparse β exact zeros |
Shrinks all uniformly |
Use when |
Feature selection needed |
All features relevant |
Geometry |
Diamond constraint β corners |
Sphere constraint β no corners |
Key intuition: L1 has corners at axes β solutions land on corners (coefficient = 0). L2 has no corners β solutions approach zero asymptotically.
Elastic Net: Combines both β good default when unsure.
What interviewers want: Geometric intuition for why L1 gives sparsity. Know when to use each.
from sklearn.linear_model import lasso_path, ridge_regression
from sklearn.preprocessing import StandardScaler
# Dataset: 5 relevant features, 15 noise features
np.random.seed(42)
n, p_true, p_noise = 200, 5, 15
X_feat = np.random.randn(n, p_true + p_noise)
true_coef = np.array([3, -2, 1.5, -1, 0.5] + [0]*p_noise)
y_feat = X_feat @ true_coef + np.random.randn(n) * 0.5
X_feat = StandardScaler().fit_transform(X_feat)
alphas = np.logspace(-3, 1, 100)
lasso_coefs = [Lasso(alpha=a, max_iter=5000).fit(X_feat, y_feat).coef_ for a in alphas]
ridge_coefs = [Ridge(alpha=a).fit(X_feat, y_feat).coef_ for a in alphas]
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for coef_path, ax, title in [(lasso_coefs, axes[0], 'Lasso (L1) β Coefficient Paths'),
(ridge_coefs, axes[1], 'Ridge (L2) β Coefficient Paths')]:
coef_arr = np.array(coef_path)
for j in range(p_true):
ax.plot(np.log10(alphas), coef_arr[:, j], linewidth=2, label=f'Relevant feat {j+1}')
for j in range(p_true, p_true + p_noise):
ax.plot(np.log10(alphas), coef_arr[:, j], 'gray', alpha=0.3, linewidth=0.8)
ax.axhline(0, color='black', linestyle='--', linewidth=0.8)
ax.set_xlabel('log10(alpha)'); ax.set_ylabel('Coefficient value')
ax.set_title(title); ax.legend(fontsize=7)
ax.invert_xaxis()
plt.tight_layout(); plt.show()
print("Lasso: noise feature coefficients hit exactly 0 as alpha increases (sparsity).")
print("Ridge: all coefficients shrink toward 0 but never reach exactly 0.")
Q10: How Does Random Forest Work?ΒΆ
Two sources of randomness β decorrelated trees:
Bootstrap sampling: Each tree trained on a random sample with replacement (~63% unique samples).
Feature subsampling: At each split, only a random subset of features considered (typically βp for classification).
Why averaging decorrelated trees reduces variance:
Var(mean of n correlated vars) = ΟΒ²/n when uncorrelated
Decorrelation via feature subsampling makes this approximation hold
Out-of-bag (OOB) error: The ~37% of samples not used to train each tree serve as a free validation set. No need for separate cross-validation.
Feature importance: Mean decrease in impurity (Gini/entropy) across all trees and splits β watch out for bias toward high-cardinality features.
What interviewers want: Both sources of randomness. Why it reduces variance not bias. OOB as built-in CV.
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.model_selection import cross_val_score
X_rf, y_rf = make_classification(n_samples=1000, n_features=10, n_informative=5,
n_redundant=2, random_state=42)
feature_names = [f'feat_{i}' for i in range(10)]
rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
rf.fit(X_rf, y_rf)
cv_scores = cross_val_score(rf, X_rf, y_rf, cv=5)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Feature importance
importances = rf.feature_importances_
sorted_idx = np.argsort(importances)[::-1]
axes[0].bar(range(10), importances[sorted_idx], color='steelblue')
axes[0].set_xticks(range(10))
axes[0].set_xticklabels([feature_names[i] for i in sorted_idx], rotation=45)
axes[0].set_title('Random Forest Feature Importances'); axes[0].set_ylabel('Mean Decrease in Impurity')
# OOB vs CV
axes[1].bar(['OOB Score', '5-Fold CV Mean'], [rf.oob_score_, cv_scores.mean()],
yerr=[0, cv_scores.std()], color=['coral', 'steelblue'], capsize=5)
axes[1].set_ylim(0.8, 1.0); axes[1].set_ylabel('Accuracy')
axes[1].set_title('OOB Score β Cross-Validation Score')
for i, v in enumerate([rf.oob_score_, cv_scores.mean()]):
axes[1].text(i, v + 0.002, f'{v:.3f}', ha='center', fontweight='bold')
plt.tight_layout(); plt.show()
print(f"OOB Score: {rf.oob_score_:.4f}")
print(f"5-Fold CV: {cv_scores.mean():.4f} Β± {cv_scores.std():.4f}")
print("Both are similar β OOB is a free and reliable validation estimate.")
Q11: Gradient Boosting vs BaggingΒΆ
Bagging (Random Forest) |
Boosting (GBM/XGBoost) |
|
|---|---|---|
Tree construction |
Parallel, independent |
Sequential, each corrects prior |
Error reduced |
Variance |
Bias |
Overfitting risk |
Lower |
Higher (needs early stopping) |
Speed |
Faster (parallelizable) |
Slower (sequential) |
Interpretability |
Similar |
Similar |
Gradient boosting mechanics:
Fit a weak learner to data β get residuals
Fit next learner to residuals (= negative gradient of loss)
Add scaled prediction to ensemble
Repeat M times
Key hyperparameters: n_estimators, learning_rate, max_depth (shallower β less variance), subsample.
What interviewers want: Sequential vs parallel. Boosting reduces bias. Trade-off: needs careful tuning.
from sklearn.tree import DecisionTreeRegressor
# Manual 3-round gradient boosting on regression
np.random.seed(42)
X_gb = np.linspace(0, 2*np.pi, 100).reshape(-1, 1)
y_gb = np.sin(X_gb.ravel()) + np.random.randn(100) * 0.2
lr_gb = 0.5
ensemble_pred = np.zeros(100)
fig, axes = plt.subplots(1, 4, figsize=(18, 4))
for i, ax in enumerate(axes):
residuals = y_gb - ensemble_pred
tree = DecisionTreeRegressor(max_depth=3)
tree.fit(X_gb, residuals)
ensemble_pred += lr_gb * tree.predict(X_gb)
ax.scatter(X_gb, y_gb, s=15, alpha=0.5, label='Data')
ax.plot(X_gb, ensemble_pred, 'r-', linewidth=2, label=f'After {i+1} rounds')
mse = np.mean((y_gb - ensemble_pred)**2)
ax.set_title(f'Round {i+1}, MSE={mse:.3f}'); ax.legend(fontsize=7)
plt.suptitle('Manual Gradient Boosting: Fitting Residuals Sequentially', y=1.02)
plt.tight_layout(); plt.show()
# GBM vs RF learning curves
X_lc, y_lc = make_classification(n_samples=1000, n_features=20, random_state=42)
n_est_range = [10, 25, 50, 75, 100]
rf_scores = [cross_val_score(RandomForestClassifier(n_estimators=n, random_state=42), X_lc, y_lc, cv=3).mean() for n in n_est_range]
gbm_scores = [cross_val_score(GradientBoostingClassifier(n_estimators=n, random_state=42), X_lc, y_lc, cv=3).mean() for n in n_est_range]
print('n_estimators:', n_est_range)
print('RF CV scores: ', [f'{s:.3f}' for s in rf_scores])
print('GBM CV scores:', [f'{s:.3f}' for s in gbm_scores])
Q12: Classification Metrics β Which to Use When?ΒΆ
Metric Use when
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Accuracy Balanced classes, equal error costs
Precision FP is costly (spam filter, irrelevant ads)
Recall FN is costly (fraud, cancer detection)
F1 Balance precision/recall, imbalanced data
ROC-AUC Ranking quality, insensitive to threshold
PR-AUC Imbalanced data, care about positive class
Log loss Probabilistic model evaluation
Accuracy paradox: On a 95/5 imbalanced dataset, always predicting the majority class gives 95% accuracy β but the model is useless for finding the minority class.
ROC-AUC vs PR-AUC: ROC-AUC is optimistic on imbalanced data (TN count inflates it). PR-AUC is more informative when the positive class is rare.
What interviewers want: Know when accuracy is misleading. Know at least one imbalance-aware metric. Understand threshold-based vs threshold-free metrics.
# 95/5 imbalanced dataset
X_imb, y_imb = make_classification(n_samples=2000, weights=[0.95, 0.05],
n_features=10, random_state=42)
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(X_imb, y_imb, test_size=0.3, random_state=42)
clf_imb = LogisticRegression(random_state=42).fit(X_tr, y_tr)
y_pred = clf_imb.predict(X_te)
y_prob = clf_imb.predict_proba(X_te)[:, 1]
y_majority = np.zeros_like(y_te) # Always predict majority class
print(f"Class distribution: {np.bincount(y_te)} (neg/pos)")
print()
print(f"{'Metric':<15} {'LR Model':>12} {'Always Negative':>18}")
print("-" * 48)
print(f"{'Accuracy':<15} {accuracy_score(y_te, y_pred):>12.3f} {accuracy_score(y_te, y_majority):>18.3f}")
print(f"{'Precision':<15} {precision_score(y_te, y_pred, zero_division=0):>12.3f} {precision_score(y_te, y_majority, zero_division=0):>18.3f}")
print(f"{'Recall':<15} {recall_score(y_te, y_pred, zero_division=0):>12.3f} {recall_score(y_te, y_majority, zero_division=0):>18.3f}")
print(f"{'F1':<15} {f1_score(y_te, y_pred, zero_division=0):>12.3f} {f1_score(y_te, y_majority, zero_division=0):>18.3f}")
print(f"{'ROC-AUC':<15} {roc_auc_score(y_te, y_prob):>12.3f} {'0.500':>18}")
print(f"{'PR-AUC':<15} {average_precision_score(y_te, y_prob):>12.3f} {'0.050 (base)':>18}")
print()
print("Accuracy paradox: 'Always negative' model is 95% accurate but catches 0 fraud cases.")
Q13: Class Imbalance β How Do You Handle It?ΒΆ
Strategies (in order of preference):
Change the metric first β donβt optimize accuracy. Use PR-AUC, F1, recall.
Class weights (
class_weight='balanced') β penalize misclassifying minority class more.Threshold tuning β move decision boundary below 0.5 to increase recall.
Resampling:
SMOTE: Synthesize new minority samples via interpolation
Undersampling: Remove majority samples (risk losing information)
Algorithm choice β tree-based methods with
class_weightoften outperform resampling.
Accuracy paradox: A model that always predicts the majority class gets 97% accuracy on a 97/3 split β but has zero predictive value.
What interviewers want: Know multiple strategies. Know SMOTE. Know that resampling is often less effective than people think. Always change the metric first.
try:
from imblearn.over_sampling import SMOTE
smote_available = True
except ImportError:
smote_available = False
X_ib, y_ib = make_classification(n_samples=3000, weights=[0.97, 0.03],
n_features=10, n_informative=5, random_state=42)
X_ibtr, X_ibte, y_ibtr, y_ibte = train_test_split(X_ib, y_ib, test_size=0.3, random_state=42)
results = {}
# No fix
m = LogisticRegression(random_state=42).fit(X_ibtr, y_ibtr)
results['No fix'] = average_precision_score(y_ibte, m.predict_proba(X_ibte)[:, 1])
# Class weight
m = LogisticRegression(class_weight='balanced', random_state=42).fit(X_ibtr, y_ibtr)
results['class_weight=balanced'] = average_precision_score(y_ibte, m.predict_proba(X_ibte)[:, 1])
# Threshold tuning
m = LogisticRegression(random_state=42).fit(X_ibtr, y_ibtr)
prob = m.predict_proba(X_ibte)[:, 1]
results['Threshold=0.2'] = average_precision_score(y_ibte, prob)
# SMOTE
if smote_available:
X_sm, y_sm = SMOTE(random_state=42).fit_resample(X_ibtr, y_ibtr)
m = LogisticRegression(random_state=42).fit(X_sm, y_sm)
results['SMOTE'] = average_precision_score(y_ibte, m.predict_proba(X_ibte)[:, 1])
else:
results['SMOTE (not installed)'] = float('nan')
print(f"Dataset: {np.bincount(y_ibtr)} train samples (neg/pos) β 97/3 split")
print(f"Accuracy paradox: always predict negative = {np.mean(y_ibte==0)*100:.1f}% accuracy")
print()
print(f"{'Method':<30} {'PR-AUC':>10}")
print("-" * 42)
for method, score in results.items():
print(f"{method:<30} {score:>10.4f}")
Q14: Cross-Validation β When NOT to Use k-Fold?ΒΆ
Standard k-fold assumes: Observations are i.i.d. (independent and identically distributed).
When this breaks:
Time series data: Future data leaks into training folds. Use
TimeSeriesSplitβ always train on past, validate on future.Grouped data: Multiple rows from same entity (patient, user). Use
GroupKFoldβ keep all rows from one entity in same fold.Very small datasets: With n=50 and 5-fold, each fold has 10 samples β noisy estimates. Use LOOCV instead.
Spatial data: Nearby observations are correlated. Use spatial cross-validation.
What interviewers want: Know the i.i.d. assumption. Immediately flag time series as a case requiring special handling. Demonstrate awareness of data leakage.
# Simulate time series with temporal pattern
np.random.seed(42)
n_ts = 300
t = np.arange(n_ts)
trend = 0.05 * t
seasonal = 10 * np.sin(2 * np.pi * t / 30)
noise = np.random.randn(n_ts) * 2
y_ts = trend + seasonal + noise
# Features: lagged values (but we'll use trivial features to show leakage)
X_ts = np.column_stack([t, t**2, np.sin(2*np.pi*t/30), np.cos(2*np.pi*t/30)])
model_ts = Ridge(alpha=1.0)
# Regular k-fold (wrong for time series β causes data leakage)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
kf_scores = cross_val_score(model_ts, X_ts, y_ts, cv=kf, scoring='r2')
# TimeSeriesSplit (correct)
tss = TimeSeriesSplit(n_splits=5)
tss_scores = cross_val_score(model_ts, X_ts, y_ts, cv=tss, scoring='r2')
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
axes[0].boxplot([kf_scores, tss_scores], labels=['K-Fold (leaky)', 'TimeSeriesSplit (correct)'])
axes[0].set_ylabel('RΒ² score'); axes[0].set_title('CV RΒ² Score Distribution')
axes[0].axhline(0, color='red', linestyle='--', alpha=0.5)
axes[1].plot(t, y_ts, alpha=0.5, label='Time series')
axes[1].set_title('Sample Time Series Data'); axes[1].set_xlabel('Time step'); axes[1].legend()
plt.tight_layout(); plt.show()
print(f"K-Fold CV RΒ²: {kf_scores.mean():.3f} Β± {kf_scores.std():.3f} β Leakage inflates score")
print(f"TimeSeriesSplit RΒ²: {tss_scores.mean():.3f} Β± {tss_scores.std():.3f} β Honest estimate")
print()
print("When to avoid standard k-fold:")
print(" - Time series data (use TimeSeriesSplit)")
print(" - Grouped data (use GroupKFold)")
print(" - Very small datasets (use LOOCV)")
Q15: SQL Window Functions β ROW_NUMBER and LAGΒΆ
Window functions perform calculations across a set of table rows related to the current row β without collapsing rows like GROUP BY does.
function() OVER (
PARTITION BY column -- "restart" for each group
ORDER BY column -- defines row order within window
)
Common window functions:
Function |
Purpose |
|---|---|
|
Sequential rank within partition (no ties) |
|
Rank with gaps for ties |
|
Rank without gaps |
|
Value from n rows before in window |
|
Value from n rows after in window |
|
Running/cumulative sum |
What interviewers want: Know PARTITION BY vs ORDER BY. Be able to write a LAG for day-over-day or order-over-order comparisons. Know that window functions donβt reduce row count.
import sqlite3
# Create synthetic orders table
conn = sqlite3.connect(':memory:')
conn.execute('''
CREATE TABLE orders (
order_id INTEGER, customer_id INTEGER,
order_date TEXT, amount REAL
)
''')
orders_data = [
(1, 101, '2024-01-05', 120.0), (2, 101, '2024-01-15', 85.0),
(3, 101, '2024-02-03', 200.0), (4, 102, '2024-01-08', 55.0),
(5, 102, '2024-01-20', 310.0), (6, 102, '2024-02-14', 90.0),
(7, 103, '2024-01-12', 175.0), (8, 103, '2024-02-01', 220.0),
]
conn.executemany('INSERT INTO orders VALUES (?, ?, ?, ?)', orders_data)
conn.commit()
# ROW_NUMBER + LAG window functions
query = '''
SELECT
order_id,
customer_id,
order_date,
amount,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date
) AS order_rank,
LAG(amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
) AS prev_amount,
ROUND(amount - LAG(amount) OVER (
PARTITION BY customer_id ORDER BY order_date
), 2) AS amount_change
FROM orders
ORDER BY customer_id, order_date
'''
result = pd.read_sql_query(query, conn)
conn.close()
print(result.to_string(index=False))
print()
print("ROW_NUMBER: restarts from 1 for each customer_id partition.")
print("LAG: NULL for first row per customer (no previous order).")