Run this notebook: Open in Colab Open in Kaggle

Load in NumPy¶

NumPy (Numerical Python) is the foundational library for scientific computing in Python. By convention, it is imported as np to keep code concise. Before using it, ensure it is installed in your environment with pip install numpy or conda install numpy. Nearly every data science and ML library – Pandas, scikit-learn, TensorFlow, PyTorch – depends on NumPy arrays as their underlying data structure.

import numpy as np

The Basics¶

Every NumPy array has key attributes that describe its structure: ndim (number of dimensions), shape (size along each dimension as a tuple), dtype (data type of elements like int32 or float64), itemsize (bytes per element), and size (total number of elements). You can specify the data type at creation with the dtype parameter. Choosing the right dtype matters for memory efficiency – int32 uses half the memory of int64, which adds up fast when working with millions of data points.

a = np.array([1,2,3], dtype='int32')
print(a)

b = np.array([[9.0,8.0,7.0],[6.0,5.0,4.0]])
print(b)

# Get Dimension
a.ndim

# Get Shape
b.shape

# Get Type
a.dtype

# Get Size
a.itemsize

# Get total size
a.nbytes

# Get number of elements
a.size

Accessing/Changing Specific Elements, Rows, Columns¶

NumPy arrays support powerful indexing using the [row, col] syntax. You can extract a single element, an entire row (a[0, :]), an entire column (a[:, 2]), or a slice with step size (a[0, 1:-1:2]). Assignment works the same way – set a single element with a[1, 5] = 20 or broadcast a value to an entire column with a[:, 2] = [1, 2]. This direct element access is fundamental to data manipulation tasks like updating feature values, extracting subsets, and building training batches.

a = np.array([[1,2,3,4,5,6,7],[8,9,10,11,12,13,14]])
print(a)

# Get a specific element [r, c]
a[1, 5]

# Get a specific row 
a[0, :]

# Get a specific column
a[:, 2]

# Getting a little more fancy [startindex:endindex:stepsize]
a[0, 1:-1:2]

a[1,5] = 20

a[:,2] = [1,2]
print(a)

3-D Array Example¶

Arrays can have three or more dimensions. A 3D array like shape (2, 2, 2) can represent a batch of matrices, an RGB image, or a small tensor. When indexing into higher-dimensional arrays, work outside in: the first index selects the outermost dimension (e.g., which matrix in the batch), the next selects the row, and the last selects the column. This mental model of peeling dimensions from the outside is essential for deep learning, where 4D and 5D tensors are common.

b = np.array([[[1,2],[3,4]],[[5,6],[7,8]]])
print(b)

# Get specific element (work outside in)
b[0,1,1]

# replace 
b[:,1,:] = [[9,9,9],[8,8]]

Initializing Different Types of Arrays¶

NumPy provides several factory functions to create arrays with specific initial values. np.zeros(shape) creates an array filled with zeros (useful for initializing accumulators). np.ones(shape) fills with ones (useful for masks). np.full(shape, value) fills with any constant. np.random.rand() produces uniform random values in [0, 1), while np.random.randint() generates random integers. np.identity(n) creates an n x n identity matrix, which is fundamental in linear algebra. These initialization patterns appear constantly when setting up weight matrices, creating masks, and generating synthetic test data.

# All 0s matrix
np.zeros((2,3))

# All 1s matrix
np.ones((4,2,2), dtype='int32')

# Any other number
np.full((2,2), 99)

# Any other number (full_like)
np.full_like(a, 4)

# Random decimal numbers
np.random.rand(4,2)

# Random Integer values
np.random.randint(-4,8, size=(3,3))

# The identity matrix
np.identity(5)

# Repeat an array
arr = np.array([[1,2,3]])
r1 = np.repeat(arr,3, axis=0)
print(r1)

output = np.ones((5,5))
print(output)

z = np.zeros((3,3))
z[1,1] = 9
print(z)

output[1:-1,1:-1] = z
print(output)

Be Careful When Copying Arrays¶

In NumPy, assigning one array to another variable with b = a does not create an independent copy – both variables point to the same underlying data. Modifying b will also modify a. To create a truly independent copy, use b = a.copy(). This copy-vs-view distinction is one of the most common sources of bugs in NumPy code, especially in data preprocessing pipelines where you want to transform data without altering the original dataset.

a = np.array([1,2,3])
b = a.copy()
b[0] = 100

print(a)

Mathematics¶

NumPy’s arithmetic operations are element-wise by default and fully vectorized. Adding a scalar (a + 2) adds it to every element; adding two arrays of the same shape (a + b) adds corresponding elements. Operations like +, -, *, /, **, and trigonometric functions (np.sin, np.cos) all work this way. Vectorized math avoids slow Python loops and leverages optimized C/Fortran routines under the hood, making NumPy hundreds of times faster than pure Python for numerical computation.

a = np.array([1,2,3,4])
print(a)

a + 2

a - 2

a * 2

a / 2

b = np.array([1,0,1,0])
a + b

a ** 2

# Take the sin
np.cos(a)

# For a lot more (https://docs.scipy.org/doc/numpy/reference/routines.math.html)

Linear Algebra¶

NumPy’s linalg submodule provides essential linear algebra operations. np.matmul(a, b) (or the @ operator) performs matrix multiplication – the inner dimensions must match (e.g., 2x3 times 3x2). np.linalg.det() computes the determinant, np.linalg.inv() finds the inverse, and np.linalg.eig() computes eigenvalues and eigenvectors. Linear algebra is the mathematical backbone of machine learning: neural networks are chains of matrix multiplications, PCA relies on eigendecomposition, and least-squares regression solves a linear system.

a = np.ones((2,3))
print(a)

b = np.full((3,2), 2)
print(b)

np.matmul(a,b)

# Find the determinant
c = np.identity(3)
np.linalg.det(c)

## Reference docs (https://docs.scipy.org/doc/numpy/reference/routines.linalg.html)

# Determinant
# Trace
# Singular Vector Decomposition
# Eigenvalues
# Matrix Norm
# Inverse
# Etc...

Statistics¶

NumPy provides fast statistical functions: np.min(), np.max(), np.sum(), np.mean(), np.std(), and np.median(). The axis parameter controls which dimension to aggregate along – axis=0 operates down columns, axis=1 operates across rows. Computing summary statistics is the first step in any exploratory data analysis (EDA) workflow, helping you understand distributions, detect outliers, and validate data quality before feeding it into models.

stats = np.array([[1,2,3],[4,5,6]])
stats

np.min(stats)

np.max(stats, axis=1)

np.sum(stats, axis=0)

Reorganizing Arrays¶

Reshaping and stacking operations let you rearrange array data without changing the underlying values. reshape(new_shape) changes the dimensions while preserving total element count. np.vstack() stacks arrays vertically (adding rows), while np.hstack() stacks horizontally (adding columns). These operations are essential for preparing data in the right shape for ML models – for example, reshaping a flat feature vector into an image tensor, or stacking multiple feature arrays into a single training matrix.

before = np.array([[1,2,3,4],[5,6,7,8]])
print(before)

after = before.reshape((2,3))
print(after)

# Vertically stacking vectors
v1 = np.array([1,2,3,4])
v2 = np.array([5,6,7,8])

np.vstack([v1,v2,v1,v2])

# Horizontal  stack
h1 = np.ones((2,4))
h2 = np.zeros((2,2))

np.hstack((h1,h2))

Miscellaneous¶

Load Data from File¶

np.genfromtxt() reads data from text files (CSV, TSV, etc.) directly into NumPy arrays. You specify the delimiter (e.g., ',' for CSV) and can handle missing values with the filling_values parameter. The .astype() method converts the array to a specific dtype after loading. While Pandas read_csv() is more common for structured data, np.genfromtxt() is lightweight and useful when you just need raw numerical arrays without the overhead of a DataFrame.

filedata = np.genfromtxt('data.txt', delimiter=',')
filedata = filedata.astype('int32')
print(filedata)

Boolean Masking and Advanced Indexing¶

Boolean masking is one of NumPy’s most powerful features. A comparison like filedata > 50 produces a boolean array of the same shape. Using this boolean array as an index selects only the elements where the condition is True. You can combine conditions with & (and), | (or), and ~ (not) – always wrapping each condition in parentheses. This technique is the NumPy equivalent of SQL’s WHERE clause and is used constantly for filtering outliers, selecting subsets, and applying conditional transformations.

(~((filedata > 50) & (filedata < 100)))