NumPyΒΆ
Credits: Forked from Parallel Machine Learning with scikit-learn and IPython by Olivier Grisel
NumPy Arrays, dtype, and shape
Common Array Operations
Reshape and Update In-Place
Combine Arrays
Create Sample Data
import numpy as np
NumPy Arrays, dtypes, and shapesΒΆ
NumPyβs ndarray is the foundational data structure for numerical computing in Python. Every array has three key properties: shape (the dimensions, such as rows and columns), dtype (the data type of every element, like int64 or float32), and the underlying contiguous memory buffer that makes operations fast. Understanding these properties matters because NumPy enforces a single dtype per array, which eliminates the per-element type overhead of Python lists and enables vectorized C-level operations.
When you call np.array(), NumPy infers the dtype from the input values. You can also create pre-filled arrays with np.zeros() and np.ones(), optionally specifying the dtype and shape. These factory functions are commonly used to initialize weight matrices, feature arrays, and placeholder tensors in machine learning workflows.
a = np.array([1, 2, 3])
print(a)
print(a.shape)
print(a.dtype)
b = np.array([[0, 2, 4], [1, 3, 5]])
print(b)
print(b.shape)
print(b.dtype)
np.zeros(5)
np.ones(shape=(3, 4), dtype=np.int32)
Common Array OperationsΒΆ
NumPy supports element-wise arithmetic β when you multiply an array by a scalar or add two arrays together, the operation is applied to every element without writing explicit loops. This is called vectorization, and it runs orders of magnitude faster than equivalent Python for loops because the computation is dispatched to optimized C routines.
Broadcasting allows NumPy to perform arithmetic between arrays of different shapes by automatically expanding the smaller array to match. Below, array a (shape (3,)) is broadcast across the rows of b (shape (2,3)) during addition. You can also slice arrays with bracket notation (d[0], d[:, 0]) and compute aggregations like sum() and mean() along specific axes. Axis 0 operates down the rows (collapsing rows), while axis 1 operates across columns β a pattern you will use constantly when computing feature-level statistics in data science.
c = b * 0.5
print(c)
print(c.shape)
print(c.dtype)
d = a + c
print(d)
d[0]
d[0, 0]
d[:, 0]
d.sum()
d.mean()
d.sum(axis=0)
d.mean(axis=1)
Reshape and Update In-PlaceΒΆ
np.arange() creates a flat 1-D array, and reshape() returns a view of the same underlying memory laid out with new dimensions. Because it is a view and not a copy, modifying the original array also changes the reshaped version (and vice versa). You can verify this relationship by checking the OWNDATA flag in ndarray.flags.
Understanding views vs. copies is essential for avoiding subtle bugs in data pipelines. When you slice or reshape a large dataset, NumPy avoids duplicating memory, which is efficient but means mutations propagate. If you need an independent copy, call .copy() explicitly.
e = np.arange(12)
print(e)
# f is a view of contents of e
f = e.reshape(3, 4)
print(f)
# Set values of e from index 5 onwards to 0
e[5:] = 0
print(e)
# f is also updated
f
# OWNDATA shows f does not own its data
f.flags
Combine ArraysΒΆ
np.concatenate() joins arrays along an existing axis, while np.vstack() and np.hstack() are convenience wrappers that stack arrays vertically (adding rows) or horizontally (adding columns). In machine learning, hstack is particularly useful for feature engineering β when you compute new features and need to append them to an existing feature matrix. vstack is the go-to for combining batches of training samples into a single dataset. Both functions require that the arrays match in size along all axes except the one being stacked.
a
b
d
np.concatenate([a, a, a])
# Use broadcasting when needed to do this automatically
np.vstack([a, b, d])
# In machine learning, useful to enrich or
# add new/concatenate features with hstack
np.hstack([b, d])
Create Sample DataΒΆ
Generating synthetic data is a critical skill for prototyping and testing. np.linspace() creates evenly spaced values over a range β ideal for plotting smooth curves. np.random.uniform() draws from a uniform distribution, and np.random.normal() adds Gaussian noise, which is how real-world measurement error is commonly modeled.
Below, we generate a logarithmic relationship with added noise, a pattern that appears in economics (diminishing returns), biology (dose-response curves), and many other domains. Visualizing synthetic data with plt.scatter() is a standard first step before fitting a regression model.
%matplotlib inline
import pylab as plt
import seaborn
seaborn.set()
# Create evenly spaced numbers over the specified interval
x = np.linspace(0, 2, 10)
plt.plot(x, 'o-');
plt.show()
# Create sample data, add some noise
x = np.random.uniform(1, 100, 1000)
y = np.log(x) + np.random.normal(0, .3, 1000)
plt.scatter(x, y)
plt.show()