Run this notebook: Open in Colab Open in Kaggle

Plot Digits Pipe¶

========================================================= Pipelining: chaining a PCA and a logistic regression¶

The PCA does an unsupervised dimensionality reduction, while the logistic regression does the prediction.

We use a GridSearchCV to set the dimensionality of the PCA

Imports for PCA-Logistic Regression Pipeline with GridSearchCV¶

Chaining StandardScaler, PCA, and LogisticRegression in a Pipeline creates a single tunable estimator: The scaler normalizes each of the 64 pixel features to zero mean and unit variance (important because PCA is sensitive to feature scale), PCA projects the data onto its top principal components, and logistic regression classifies in the reduced space. The GridSearchCV simultaneously searches over pca__n_components (5 to 60) and logistic__C (0.0001 to 10000), finding the combination that maximizes cross-validated accuracy. The double-underscore syntax (pca__n_components) lets GridSearchCV reach into named pipeline steps to set their parameters.

The PCA explained variance spectrum guides the choice of dimensionality: Plotting pca.explained_variance_ratio_ against component index shows how much information each additional component captures. A sharp “elbow” indicates that most variance is concentrated in the first few components, while a gradual decline suggests the data has many relevant dimensions. The best n_components found by GridSearchCV is overlaid on this plot as a vertical line, showing where the cross-validated accuracy peaks relative to the variance spectrum. Using polars.LazyFrame for results processing demonstrates sklearn’s compatibility with modern DataFrame libraries beyond pandas, filtering the CV results to find the best classifier score at each component count.

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

import matplotlib.pyplot as plt
import numpy as np
import polars as pl

from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Define a pipeline to search for the best combination of PCA truncation
# and classifier regularization.
pca = PCA()
# Define a Standard Scaler to normalize inputs
scaler = StandardScaler()

# set the tolerance to a large value to make the example faster
logistic = LogisticRegression(max_iter=10000, tol=0.1)
pipe = Pipeline(steps=[("scaler", scaler), ("pca", pca), ("logistic", logistic)])

X_digits, y_digits = datasets.load_digits(return_X_y=True)
# Parameters of pipelines can be set using '__' separated parameter names:
param_grid = {
    "pca__n_components": [5, 15, 30, 45, 60],
    "logistic__C": np.logspace(-4, 4, 4),
}
search = GridSearchCV(pipe, param_grid, n_jobs=2)
search.fit(X_digits, y_digits)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

# Plot the PCA spectrum
pca.fit(X_digits)

fig, (ax0, ax1) = plt.subplots(nrows=2, sharex=True, figsize=(6, 6))
ax0.plot(
    np.arange(1, pca.n_components_ + 1), pca.explained_variance_ratio_, "+", linewidth=2
)
ax0.set_ylabel("PCA explained variance ratio")

ax0.axvline(
    search.best_estimator_.named_steps["pca"].n_components,
    linestyle=":",
    label="n_components chosen",
)
ax0.legend(prop=dict(size=12))

# For each number of components, find the best classifier results
components_col = "param_pca__n_components"
is_max_test_score = pl.col("mean_test_score") == pl.col("mean_test_score").max()
best_clfs = (
    pl.LazyFrame(search.cv_results_)
    .filter(is_max_test_score.over(components_col))
    .unique(components_col)
    .sort(components_col)
    .collect()
)
ax1.errorbar(
    best_clfs[components_col],
    best_clfs["mean_test_score"],
    yerr=best_clfs["std_test_score"],
)
ax1.set_ylabel("Classification accuracy (val)")
ax1.set_xlabel("n_components")

plt.xlim(-1, 70)

plt.tight_layout()
plt.show()