NumPy¶
NumPy (or Numpy) is a Linear Algebra Library for Python, the reason it is so important for Data Science with Python is that almost all of the libraries in the PyData Ecosystem rely on NumPy as one of their main building blocks.
Numpy is also incredibly fast, as it has bindings to C libraries. For more info on why you would want to use Arrays instead of lists, check out this great StackOverflow post.
We will only learn the basics of NumPy, to get started we need to install it!
Installation Instructions¶
It is highly recommended you install Python using the Anaconda distribution to make sure all underlying dependencies (such as Linear Algebra libraries) all sync up with the use of a conda install. If you have Anaconda, install NumPy by going to your terminal or command prompt and typing:
conda install numpy
If you do not have Anaconda and can not install it, please refer to Numpy’s official documentation on various installation instructions.
Using NumPy¶
Once you’ve installed NumPy you can import it as a library:
import numpy as np
Numpy has many built-in functions and capabilities. We won’t cover them all but instead we will focus on some of the most important aspects of Numpy: vectors,arrays,matrices, and number generation. Let’s start by discussing arrays.
Numpy Arrays¶
NumPy arrays are the main way we will use Numpy throughout the course. Numpy arrays essentially come in two flavors: vectors and matrices. Vectors are strictly 1-d arrays and matrices are 2-d (but you should note a matrix can still have only one row or one column).
Let’s begin our introduction by exploring how to create NumPy arrays.
Creating NumPy Arrays¶
From a Python List¶
We can create an array by directly converting a list or list of lists:
# declare a python list
my_list = [1,2,3]
# print the list
my_list
# convert the list to a numpy array
np.array(my_list)
Creating a 2D Array from Nested Lists¶
To create a 2D array (matrix), pass a list of lists to np.array(). Each inner list becomes a row. All inner lists must have the same length, or NumPy will create an array of objects rather than a proper numeric matrix. This conversion from Python lists to NumPy arrays is typically the first step when preparing data for analysis – raw data arrives as lists, and you convert it to arrays to unlock vectorized operations.
# create a 2D python list (a list of lists)
my_matrix = [[1,2,3],[4,5,6],[7,8,9]]
# print the 2D list
my_matrix
# convert the 2D list to a numpy array
np.array(my_matrix)
Built-in Methods¶
NumPy provides a rich set of array creation functions beyond np.array(). These let you generate arrays with specific patterns, ranges, or random values without manually constructing Python lists first. Mastering these factory functions is essential for efficient data science work – they save time and clearly communicate intent.
arange¶
Return evenly spaced values within a given interval.
How arange Differs from Python’s range¶
np.arange() works similarly to Python’s built-in range() – it starts at the first argument and goes up to (but does not include) the second argument. The key difference is that np.arange() returns a NumPy array directly, so you can immediately apply vectorized operations. An optional third argument specifies the step size between values.
# create a numpy array with values from 0 to 9
np.arange(0,10)
can also specify step between each number in range
# create a numpy array with values from 0 to 10 with a step of 2
np.arange(0,11,2)
zeros and ones¶
Generate arrays of zeros or ones
# create a numpy array of zeros with length 3
np.zeros(3)
# create a 2D numpy array of zeros with shape (5,5)
np.zeros((5,5))
# create a numpy array of ones with length 3
np.ones(3)
# create a 2D numpy array of ones with shape (3,3)
np.ones((3,3))
linspace¶
Return evenly spaced numbers over a specified interval.
linspace vs arange¶
Unlike np.arange() where the third argument is the step size, np.linspace() takes the number of points you want as the third argument. It then calculates the spacing automatically to produce that many evenly distributed values between start and stop (both inclusive). This is particularly useful for generating smooth x-axis values for plotting functions, creating evenly spaced thresholds for model evaluation, or defining grid points for numerical integration.
# create a numpy array of 3 evenly spaced numbers between 0 and 10
np.linspace(0,10,3)
# create a numpy array of 50 evenly spaced numbers between 0 and 10
np.linspace(0,10,50)
eye¶
Creates an identity matrix
Identity Matrix¶
An identity matrix is a square matrix with 1s on the main diagonal and 0s everywhere else. Created with np.eye(n), it serves as the multiplicative identity in linear algebra – multiplying any matrix by the identity matrix returns the original matrix unchanged. Identity matrices are used in regularization (ridge regression adds a scaled identity to prevent overfitting), as initialization for transformation matrices, and as a baseline in matrix decomposition algorithms.
# create a 4x4 identity matrix
np.eye(4)
Random¶
Numpy also has lots of ways to create random number arrays:
rand¶
Create an array of the given shape and populate it with
random samples from a uniform distribution
over [0, 1).
# create a numpy array of 2 random numbers between 0 and 1
np.random.rand(2)
Random 2D Arrays¶
For a 2D matrix of random values, pass two arguments to np.random.rand(rows, cols). Note the subtle syntax difference: rand() takes dimensions as separate arguments (not a tuple), while functions like np.zeros() take a tuple. This inconsistency is a common source of confusion, so pay attention to each function’s signature.
# create a 5x5 array of random numbers between 0 and 1
np.random.rand(5,5)
randn¶
Return a sample (or samples) from the “standard normal” distribution. Unlike rand which is uniform:
# create a numpy array of 2 random numbers sampled from a standard normal distribution
np.random.randn(2)
# create a 5x5 array of random numbers sampled from a standard normal distribution
np.random.randn(5,5)
randint¶
Return random integers from low (inclusive) to high (exclusive).
# print a random integer between 0 and 10
np.random.randint(0,11)
# create an array of 10 random integers between 0 and 50
np.random.randint(0,50,10)
Random Integers with Size¶
The third argument to np.random.randint(low, high, size) specifies how many random integers to generate. You can pass a single integer for a 1D array or a tuple for multi-dimensional arrays. This is useful for generating synthetic labels, random indices for sampling, or test data for algorithm development.
# create a numpy array of 10 random integers between 1 and 100
np.random.randint(1,100,10)
Array Attributes and Methods¶
Let’s discuss some useful attributes and methods or an array:
# create a numpy array of 25 numbers between 0 and 24
arr = np.arange(25)
ranarr = np.random.randint(0,50,10)
# print the 25 numbers between 0 and 24
arr
# print the 10 random integers between 0 and 50
ranarr
Reshape¶
Returns an array containing the same data with a new shape.
Reshape Constraints¶
When reshaping an array, the total number of elements must remain the same – you cannot reshape a 25-element array into a 5x10 matrix (which would need 50 elements). NumPy will raise a ValueError if the dimensions are incompatible. A useful trick is using -1 for one dimension (e.g., arr.reshape(-1, 5)), which tells NumPy to calculate that dimension automatically based on the total size.
# reshape the array of 25 numbers into a 5x5 array
arr.reshape(5,5)
# reshape the array of 25 numbers into a 5x10 array (this will throw an error because the total number of elements is not the same)
arr.reshape(5,10)
max,min,argmax,argmin¶
These are useful methods for finding max or min values. Or to find their index locations using argmin or argmax
# print the 10 random integers between 0 and 50
ranarr
# find the maximum value in the array of 10 random integers
ranarr.max()
argmax and argmin – Finding Index Positions¶
While max() and min() return the extreme values themselves, argmax() and argmin() return the index positions of those values. Knowing the position of the maximum is often more useful than the value itself – for example, argmax() on a model’s output probabilities tells you which class the model predicts, and argmin() on a loss array tells you which hyperparameter configuration performed best.
# find the index of the maximum value in the array of 10 random integers
ranarr.argmax()
# find the minimum value in the array of 10 random integers
ranarr.min()
to return index location of min value
# find the index of the minimum value in the array of 10 random integers
ranarr.argmin()
Shape¶
Shape is an attribute that arrays have (not a method):
# Vector
arr.shape
# Notice the two sets of brackets
arr.reshape(1,25)
# Notice the shape has changed
arr.reshape(1,25).shape
# Notice the shape has changed
arr.reshape(25,1)
# Notice the shape has changed
arr.reshape(25,1).shape
dtype¶
You can also grab the data type of the object in the array:
# print the data type of the array
arr.dtype
Importing Specific Functions¶
Instead of importing all of NumPy, you can import individual functions directly with from numpy.random import randint. This lets you call randint() without the np.random. prefix. While convenient for frequently used functions, the standard import numpy as np convention is preferred in production code because it makes the source of each function explicit and avoids name collisions.
# create an array of 10 random integers between 0 and 50
from numpy.random import randint
# create an array of 10 random integers between 2 and 10
randint(2,10)