Run this notebook: Open in Colab Open in Kaggle

Plot Incremental Pca¶

=============== Incremental PCA¶

Incremental principal component analysis (IPCA) is typically used as a replacement for principal component analysis (PCA) when the dataset to be decomposed is too large to fit in memory. IPCA builds a low-rank approximation for the input data using an amount of memory which is independent of the number of input data samples. It is still dependent on the input data features, but changing the batch size allows for control of memory usage.

This example serves as a visual check that IPCA is able to find a similar projection of the data to PCA (to a sign flip), while only processing a few samples at a time. This can be considered a “toy example”, as IPCA is intended for large datasets which do not fit in main memory, requiring incremental approaches.

Imports for Incremental PCA on Large Datasets¶

Memory-efficient PCA via mini-batches: IncrementalPCA computes the same principal components as standard PCA but processes data in chunks (batch_size=10 samples at a time), keeping memory usage independent of the total number of samples. This is essential for datasets that exceed available RAM – instead of loading all data and computing the full SVD, IncrementalPCA maintains a running estimate of the covariance structure using sequential partial SVD updates. The fit_transform method internally iterates over mini-batches, or you can call partial_fit repeatedly on streaming data.

Equivalence to standard PCA: The projections from IncrementalPCA and PCA should be nearly identical (up to a sign flip on individual components, which is inherent to SVD). The mean absolute unsigned error between the two projections quantifies how close the incremental approximation is to the exact solution. On the small iris dataset this error is negligible, but on larger datasets with smaller batch sizes relative to the feature count, the approximation may be slightly less precise. The key tradeoff is memory (controlled by batch_size) versus accuracy of the decomposition.

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA, IncrementalPCA

iris = load_iris()
X = iris.data
y = iris.target

n_components = 2
ipca = IncrementalPCA(n_components=n_components, batch_size=10)
X_ipca = ipca.fit_transform(X)

pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X)

colors = ["navy", "turquoise", "darkorange"]

for X_transformed, title in [(X_ipca, "Incremental PCA"), (X_pca, "PCA")]:
    plt.figure(figsize=(8, 8))
    for color, i, target_name in zip(colors, [0, 1, 2], iris.target_names):
        plt.scatter(
            X_transformed[y == i, 0],
            X_transformed[y == i, 1],
            color=color,
            lw=2,
            label=target_name,
        )

    if "Incremental" in title:
        err = np.abs(np.abs(X_pca) - np.abs(X_ipca)).mean()
        plt.title(title + " of iris dataset\nMean absolute unsigned error %.6f" % err)
    else:
        plt.title(title + " of iris dataset")
    plt.legend(loc="best", shadow=False, scatterpoints=1)
    plt.axis([-4, 4, -1.5, 1.5])

plt.show()