Run this notebook: Open in Colab Open in Kaggle

Statistics¶

NumPy’s statistics functions compute descriptive measures that summarize data distributions. From basic order statistics (min, max, percentiles) to averages, variances, correlations, and histograms, these tools form the backbone of Exploratory Data Analysis (EDA). Understanding your data’s central tendency, spread, and relationships between variables is a prerequisite for choosing appropriate models, detecting anomalies, and validating assumptions. Every data science project begins with statistics.

__author__ = "kyubyong. kbpark.linguist@gmail.com"

import numpy as np

np.__version__

Order Statistics¶

Order statistics describe the ranking and extremes of data. np.min() and np.max() find extreme values, np.ptp() computes the range (max - min), and np.percentile() computes arbitrary percentiles (e.g., the 75th percentile is the value below which 75% of observations fall). These are essential for understanding data spread, detecting outliers, and computing interquartile ranges (IQR) for box plots.

Q1. Return the minimum value of x along the second axis.

x = np.arange(4).reshape((2, 2))
print("x=\n", x)

Q2. Return the maximum value of x along the second axis. Reduce the second axis to the dimension with size one.

x = np.arange(4).reshape((2, 2))
print("x=\n", x)

Q3. Calcuate the difference between the maximum and the minimum of x along the second axis.

x = np.arange(10).reshape((2, 5))
print("x=\n", x)

Q4. Compute the 75th percentile of x along the second axis.

x = np.arange(1, 11).reshape((2, 5))
print("x=\n", x)

Averages and Variances¶

np.median() finds the middle value (robust to outliers), np.mean() computes the arithmetic average, np.average() computes a weighted average, np.std() measures spread via standard deviation, and np.var() computes variance. The choice between mean and median matters: the mean is sensitive to outliers (a single extreme salary skews the average), while the median is resistant. Standard deviation and variance quantify how spread out the data is – critical for normalization, confidence intervals, and understanding model uncertainty.

Q5. Compute the median of flattened x.

x = np.arange(1, 10).reshape((3, 3))
print("x=\n", x)

Q6. Compute the weighted average of x.

x = np.arange(5)
weights = np.arange(1, 6)

Q7. Compute the mean, standard deviation, and variance of x along the second axis.

x = np.arange(5)
print("x=\n",x)

Correlating¶

np.cov() computes the covariance matrix, which measures how two variables change together. np.corrcoef() normalizes covariance into Pearson correlation coefficients (ranging from -1 to +1), making it easier to compare relationships across different scales. np.correlate() computes cross-correlation, which measures similarity between two signals as one is shifted relative to the other. Correlation analysis is fundamental for feature selection (identifying redundant features), understanding relationships between variables, and validating that model inputs carry predictive information.

Q8. Compute the covariance matrix of x and y.

x = np.array([0, 1, 2])
y = np.array([2, 1, 0])

Q9. In the above covariance matrix, what does the -1 mean?

Q10. Compute Pearson product-moment correlation coefficients of x and y.

x = np.array([0, 1, 3])
y = np.array([2, 4, 5])

Q11. Compute cross-correlation of x and y.

x = np.array([0, 1, 3])
y = np.array([2, 4, 5])

Histograms¶

np.histogram() bins data into intervals and counts how many values fall in each bin, producing the raw data for a histogram plot. np.histogram2d() extends this to two dimensions. np.bincount() counts occurrences of each non-negative integer value. np.digitize() returns bin indices for each value. Histograms are the most fundamental tool for visualizing data distributions – they reveal whether data is normally distributed, skewed, bimodal, or contains outliers, all of which inform modeling decisions.

Q12. Compute the histogram of x against the bins.

x = np.array([0.5, 0.7, 1.0, 1.2, 1.3, 2.1])
bins = np.array([0, 1, 2, 3])
print("ans=\n", ...)

import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(x, bins=bins)
plt.show()

Q13. Compute the 2d histogram of x and y.

xedges = [0, 1, 2, 3]
yedges = [0, 1, 2, 3, 4]
x = np.array([0, 0.1, 0.2, 1., 1.1, 2., 2.1])
y = np.array([0, 0.1, 0.2, 1., 1.1, 2., 3.3])
...

plt.scatter(x, y)
plt.grid()

Q14. Count number of occurrences of 0 through 7 in x.

x = np.array([0, 1, 1, 3, 2, 1, 7])

Q15. Return the indices of the bins to which each value in x belongs.

x = np.array([0.2, 6.4, 3.0, 1.6])
bins = np.array([0.0, 1.0, 2.5, 4.0, 10.0])