Run this notebook: Open in Colab Open in Kaggle

Statistics – Solutions¶

Solutions demonstrating NumPy’s statistical functions for summarizing and analyzing data distributions. These cover order statistics (np.amin, np.amax, np.ptp, np.percentile), central tendency and dispersion (np.mean, np.median, np.std, np.var, np.average), correlation analysis (np.cov, np.corrcoef, np.correlate), and histogram computation (np.histogram, np.histogram2d, np.bincount, np.digitize). Together they form the foundation of exploratory data analysis and feature engineering in machine learning.

__author__ = "kyubyong. kbpark.linguist@gmail.com"

import numpy as np

np.__version__

Order Statistics¶

Solutions using np.amin(), np.amax(), np.ptp() (peak-to-peak, i.e., range), and np.percentile() to extract positional summaries from data. The axis and keepdims parameters control whether aggregation collapses the entire array or operates along a specific dimension – a pattern shared across nearly all NumPy reduction functions.

Q1. Return the minimum value of x along the second axis.

x = np.arange(4).reshape((2, 2))
print("x=\n", x)
print("ans=\n", np.amin(x, 1))

Q2. Return the maximum value of x along the second axis. Reduce the second axis to the dimension with size one.

x = np.arange(4).reshape((2, 2))
print("x=\n", x)
print("ans=\n", np.amax(x, 1, keepdims=True))

Q3. Calcuate the difference between the maximum and the minimum of x along the second axis.

x = np.arange(10).reshape((2, 5))
print("x=\n", x)

out1 = np.ptp(x, 1)
out2 = np.amax(x, 1) - np.amin(x, 1)
assert np.allclose(out1, out2)
print("ans=\n", out1)

Q4. Compute the 75th percentile of x along the second axis.

x = np.arange(1, 11).reshape((2, 5))
print("x=\n", x)

print("ans=\n", np.percentile(x, 75, 1))

Averages and Variances¶

Solutions using np.mean(), np.median(), np.average() (with optional weights), np.std(), and np.var() for measuring central tendency and spread. Note that np.average() supports a weights parameter for computing weighted means – useful when samples have different importance, such as class-imbalanced datasets or time-decayed observations.

Q5. Compute the median of flattened x.

x = np.arange(1, 10).reshape((3, 3))
print("x=\n", x)

print("ans=\n", np.median(x))

Q6. Compute the weighted average of x.

x = np.arange(5)
weights = np.arange(1, 6)

out1 = np.average(x, weights=weights)
out2 = (x*(weights/weights.sum())).sum()
assert np.allclose(out1, out2)
print(out1)

Q7. Compute the mean, standard deviation, and variance of x along the second axis.

x = np.arange(5)
print("x=\n",x)

out1 = np.mean(x)
out2 = np.average(x)
assert np.allclose(out1, out2)
print("mean=\n", out1)

out3 = np.std(x)
out4 = np.sqrt(np.mean((x - np.mean(x)) ** 2 ))
assert np.allclose(out3, out4)
print("std=\n", out3)

out5 = np.var(x)
out6 = np.mean((x - np.mean(x)) ** 2 )
assert np.allclose(out5, out6)
print("variance=\n", out5)

Correlating¶

Solutions using np.cov(), np.corrcoef(), and np.correlate() to quantify relationships between variables. The covariance matrix captures how two variables move together, while Pearson correlation normalizes this to the [-1, 1] range for easier interpretation. These are fundamental to feature selection – highly correlated features are often redundant and can be pruned to reduce model complexity.

Q8. Compute the covariance matrix of x and y.

x = np.array([0, 1, 2])
y = np.array([2, 1, 0])

print("ans=\n", np.cov(x, y))

Q9. In the above covariance matrix, what does the -1 mean?

It means x and y correlate perfectly in opposite directions.

Q10. Compute Pearson product-moment correlation coefficients of x and y.

x = np.array([0, 1, 3])
y = np.array([2, 4, 5])

print("ans=\n", np.corrcoef(x, y))

Q11. Compute cross-correlation of x and y.

x = np.array([0, 1, 3])
y = np.array([2, 4, 5])

print("ans=\n", np.correlate(x, y))

Histograms¶

Solutions using np.histogram(), np.histogram2d(), np.bincount(), and np.digitize() for discretizing continuous data into bins. Histograms reveal the shape of a distribution (skewness, modality, outliers) and are the basis for probability density estimation. np.bincount() is optimized for integer data, while np.digitize() maps each value to its bin index – useful for converting continuous features into categorical ones during feature engineering.

Q12. Compute the histogram of x against the bins.

x = np.array([0.5, 0.7, 1.0, 1.2, 1.3, 2.1])
bins = np.array([0, 1, 2, 3])
print("ans=\n", np.histogram(x, bins))

import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(x, bins=bins)
plt.show()

Q13. Compute the 2d histogram of x and y.

xedges = [0, 1, 2, 3]
yedges = [0, 1, 2, 3, 4]
x = np.array([0, 0.1, 0.2, 1., 1.1, 2., 2.1])
y = np.array([0, 0.1, 0.2, 1., 1.1, 2., 3.3])
H, xedges, yedges = np.histogram2d(x, y, bins=(xedges, yedges))
print("ans=\n", H)

plt.scatter(x, y)
plt.grid()

Q14. Count number of occurrences of 0 through 7 in x.

x = np.array([0, 1, 1, 3, 2, 1, 7])
print("ans=\n", np.bincount(x))

Q15. Return the indices of the bins to which each value in x belongs.

x = np.array([0.2, 6.4, 3.0, 1.6])
bins = np.array([0.0, 1.0, 2.5, 4.0, 10.0])

print("ans=\n", np.digitize(x, bins))