Run this notebook: Open in Colab Open in Kaggle

matplotlib¶

Credits: Content forked from Parallel Machine Learning with scikit-learn and IPython by Olivier Grisel

Setting Global Parameters
Basic Plots
Histograms
Two Histograms on the Same Plot
Scatter Plots

%matplotlib inline
import pandas as pd
import numpy as np
import pylab as plt
import seaborn

Setting Global Parameters¶

Before creating any plots, it is good practice to configure global defaults that apply to all subsequent figures. plt.rc('figure', figsize=(10, 5)) sets the default width and height (in inches) so you do not have to specify it in every plot call. seaborn.set() applies a clean, publication-ready aesthetic – adjusting font sizes, grid lines, and color palettes – on top of matplotlib’s defaults. In data science workflows, consistent styling makes it easier to compare visualizations across notebooks and share them in reports.

# Set the global default size of matplotlib figures
plt.rc('figure', figsize=(10, 5))

# Set seaborn aesthetic parameters to defaults
seaborn.set()

Basic Line Plots¶

Line plots are the most fundamental visualization type, ideal for showing how a quantity changes across an ordered variable (typically time or a continuous input). plt.plot() accepts x and y arrays plus a format string – 'o-' means circular markers connected by solid lines, while 'x-' uses x-shaped markers. Adding legend(), title(), xlabel(), and ylabel() transforms a bare chart into a self-documenting figure. Below, we compare linear and quadratic growth side by side, a common pattern when evaluating algorithmic complexity or model scaling behavior.

x = np.linspace(0, 2, 10)

plt.plot(x, x, 'o-', label='linear')
plt.plot(x, x ** 2, 'x-', label='quadratic')

plt.legend(loc='best')
plt.title('Linear vs Quadratic progression')
plt.xlabel('Input')
plt.ylabel('Output');
plt.show()

Histograms¶

Histograms reveal the distribution of a single numeric variable by grouping values into bins and counting how many fall into each. They are essential for detecting skewness, multimodality, and outliers before building any model. np.random.normal() generates samples from a Gaussian distribution with specified loc (mean) and scale (standard deviation). The bins parameter controls resolution – too few bins hide structure, too many create noise. As a rule of thumb, start with 30-50 bins and adjust based on sample size.

# Gaussian, mean 1, stddev .5, 1000 elements
samples = np.random.normal(loc=1.0, scale=0.5, size=1000)
print(samples.shape)
print(samples.dtype)
print(samples[:30])
plt.hist(samples, bins=50);
plt.show()

Overlaying Two Histograms¶

Plotting two distributions on the same axes is the fastest way to visually compare them. The alpha parameter (0 to 1) controls transparency, so overlapping regions remain visible. Here, a Gaussian distribution and a Student’s t-distribution with 10 degrees of freedom are drawn with shared bin edges created by np.linspace(-3, 3, 50). The t-distribution has heavier tails, meaning extreme values are more likely – a pattern that matters when modeling financial returns or any data where outliers are expected.

samples_1 = np.random.normal(loc=1, scale=.5, size=10000)
samples_2 = np.random.standard_t(df=10, size=10000)
bins = np.linspace(-3, 3, 50)

# Set an alpha and use the same bins since we are plotting two hists
plt.hist(samples_1, bins=bins, alpha=0.5, label='samples 1')
plt.hist(samples_2, bins=bins, alpha=0.5, label='samples 2')
plt.legend(loc='upper left');
plt.show()

Scatter Plots¶

Scatter plots display the relationship between two continuous variables, with each point representing one observation. They are the primary tool for visually detecting correlations, clusters, and outliers in bivariate data. The alpha parameter is especially important for large datasets – setting it low (e.g., 0.1) creates a density effect where darker regions indicate more data points, effectively turning the scatter plot into a rough density estimate without any binning.

plt.scatter(samples_1, samples_2, alpha=0.1);
plt.show()