Run this notebook: Open in Colab Open in Kaggle

Plot Pca Iris¶

================================================== Principal Component Analysis (PCA) on Iris Dataset¶

This example shows a well known decomposition technique known as Principal Component Analysis (PCA) on the Iris dataset <https://en.wikipedia.org/wiki/Iris_flower_data_set>_.

This dataset is made of 4 features: sepal length, sepal width, petal length, petal width. We use PCA to project this 4 feature space into a 3-dimensional space.

Imports for PCA on the Iris Dataset¶

Principal Component Analysis as variance-maximizing projection: PCA finds the orthogonal directions (principal components) in feature space along which the data varies the most. Mathematically, it performs eigendecomposition of the covariance matrix (or equivalently, SVD of the centered data matrix), ranking components by their explained variance ratio. For the 4-dimensional iris dataset, projecting onto 3 principal components captures most of the total variance while enabling 3D visualization. The pairplot of original features reveals that petal length and width are the most discriminant, which aligns with PCA’s tendency to place these high-variance features prominently in the first component.

Unsupervised dimensionality reduction for visualization: Unlike supervised methods (e.g., LDA), PCA does not use class labels – it finds directions of maximum variance regardless of whether those directions separate classes. For the iris dataset, this works well because the directions of greatest variance happen to coincide with the directions that separate species, particularly the linearly separable Setosa. The 3D scatter plot colored by species shows that the first principal component alone nearly separates all three classes, demonstrating that PCA is an effective preprocessing step for visualization and can also serve as a feature extraction technique to reduce input dimensionality before classification.

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

# %%
# Loading the Iris dataset
# ------------------------
#
# The Iris dataset is directly available as part of scikit-learn. It can be loaded
# using the :func:`~sklearn.datasets.load_iris` function. With the default parameters,
# a :class:`~sklearn.utils.Bunch` object is returned, containing the data, the
# target values, the feature names, and the target names.
from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)
print(iris.keys())

# %%
# Plot of pairs of features of the Iris dataset
# ---------------------------------------------
#
# Let's first plot the pairs of features of the Iris dataset.
import seaborn as sns

# Rename classes using the iris target names
iris.frame["target"] = iris.target_names[iris.target]
_ = sns.pairplot(iris.frame, hue="target")

# %%
# Each data point on each scatter plot refers to one of the 150 iris flowers
# in the dataset, with the color indicating their respective type
# (Setosa, Versicolor, and Virginica).
#
# You can already see a pattern regarding the Setosa type, which is
# easily identifiable based on its short and wide sepal. Only
# considering these two dimensions, sepal width and length, there's still
# overlap between the Versicolor and Virginica types.
#
# The diagonal of the plot shows the distribution of each feature. We observe
# that the petal width and the petal length are the most discriminant features
# for the three types.
#
# Plot a PCA representation
# -------------------------
# Let's apply a Principal Component Analysis (PCA) to the iris dataset
# and then plot the irises across the first three principal components.
# This will allow us to better differentiate among the three types!

import matplotlib.pyplot as plt

# unused but required import for doing 3d projections with matplotlib < 3.2
import mpl_toolkits.mplot3d  # noqa: F401

from sklearn.decomposition import PCA

fig = plt.figure(1, figsize=(8, 6))
ax = fig.add_subplot(111, projection="3d", elev=-150, azim=110)

X_reduced = PCA(n_components=3).fit_transform(iris.data)
scatter = ax.scatter(
    X_reduced[:, 0],
    X_reduced[:, 1],
    X_reduced[:, 2],
    c=iris.target,
    s=40,
)

ax.set(
    title="First three principal components",
    xlabel="1st Principal Component",
    ylabel="2nd Principal Component",
    zlabel="3rd Principal Component",
)
ax.xaxis.set_ticklabels([])
ax.yaxis.set_ticklabels([])
ax.zaxis.set_ticklabels([])

# Add a legend
legend1 = ax.legend(
    scatter.legend_elements()[0],
    iris.target_names.tolist(),
    loc="upper right",
    title="Classes",
)
ax.add_artist(legend1)

plt.show()

# %%
# PCA will create 3 new features that are a linear combination of the 4 original
# features. In addition, this transformation maximizes the variance. With this
# transformation, we can identify each species using only the first principal component.