Plot Grid Search Text Feature ExtractionΒΆ

========================================================== Sample pipeline for text feature extraction and evaluationΒΆ

The dataset used in this example is :ref:20newsgroups_dataset which will be automatically downloaded, cached and reused for the document classification example.

In this example, we tune the hyperparameters of a particular classifier using a

class:

~sklearn.model_selection.RandomizedSearchCV. For a demo on the performance of some other classifiers, see the

ref:

sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py notebook.

Helper for Cleaning Pipeline Parameter NamesΒΆ

Removing pipeline prefixes for readability: The shorten_param function strips the component prefix (e.g., vect__ or clf__) from parameter names in cv_results_, making column headers and plot labels cleaner. This is necessary because scikit-learn pipelines namespace parameters with double underscores to avoid ambiguity, but these prefixes clutter visualizations when the pipeline structure is already understood.

def shorten_param(param_name):
    """Remove components' prefixes in param_name."""
    if "__" in param_name:
        return param_name.rsplit("__", 1)[1]
    return param_name


cv_results = pd.DataFrame(random_search.cv_results_)
cv_results = cv_results.rename(shorten_param, axis=1)

# %%
# We can use a `plotly.express.scatter
# <https://plotly.com/python-api-reference/generated/plotly.express.scatter.html>`_
# to visualize the trade-off between scoring time and mean test score (i.e. "CV
# score"). Passing the cursor over a given point displays the corresponding
# parameters. Error bars correspond to one standard deviation as computed in the
# different folds of the cross-validation.

import plotly.express as px

param_names = [shorten_param(name) for name in parameter_grid.keys()]
labels = {
    "mean_score_time": "CV Score time (s)",
    "mean_test_score": "CV score (accuracy)",
}
fig = px.scatter(
    cv_results,
    x="mean_score_time",
    y="mean_test_score",
    error_x="std_score_time",
    error_y="std_test_score",
    hover_data=param_names,
    labels=labels,
)
fig.update_layout(
    title={
        "text": "trade-off between scoring time and mean test score",
        "y": 0.95,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    }
)
fig

# %%
# Notice that the cluster of models in the upper-left corner of the plot have
# the best trade-off between accuracy and scoring time. In this case, using
# bigrams increases the required scoring time without improving considerably the
# accuracy of the pipeline.
#
# .. note:: For more information on how to customize an automated tuning to
#    maximize score and minimize scoring time, see the example notebook
#    :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_digits.py`.
#
# We can also use a `plotly.express.parallel_coordinates
# <https://plotly.com/python-api-reference/generated/plotly.express.parallel_coordinates.html>`_
# to further visualize the mean test score as a function of the tuned
# hyperparameters. This helps finding interactions between more than two
# hyperparameters and provide intuition on their relevance for improving the
# performance of a pipeline.
#
# We apply a `math.log10` transformation on the `alpha` axis to spread the
# active range and improve the readability of the plot. A value :math:`x` on
# said axis is to be understood as :math:`10^x`.

import math

column_results = param_names + ["mean_test_score", "mean_score_time"]

transform_funcs = dict.fromkeys(column_results, lambda x: x)
# Using a logarithmic scale for alpha
transform_funcs["alpha"] = math.log10
# L1 norms are mapped to index 1, and L2 norms to index 2
transform_funcs["norm"] = lambda x: 2 if x == "l2" else 1
# Unigrams are mapped to index 1 and bigrams to index 2
transform_funcs["ngram_range"] = lambda x: x[1]

fig = px.parallel_coordinates(
    cv_results[column_results].apply(transform_funcs),
    color="mean_test_score",
    color_continuous_scale=px.colors.sequential.Viridis_r,
    labels=labels,
)
fig.update_layout(
    title={
        "text": "Parallel coordinates plot of text classifier pipeline",
        "y": 0.99,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    }
)
fig

# %%
# The parallel coordinates plot displays the values of the hyperparameters on
# different columns while the performance metric is color coded. It is possible
# to select a range of results by clicking and holding on any axis of the
# parallel coordinate plot. You can then slide (move) the range selection and
# cross two selections to see the intersections. You can undo a selection by
# clicking once again on the same axis.
#
# In particular for this hyperparameter search, it is interesting to notice that
# the top performing models do not seem to depend on the regularization `norm`,
# but they do depend on a trade-off between `max_df`, `min_df` and the
# regularization strength `alpha`. The reason is that including noisy features
# (i.e. `max_df` close to :math:`1.0` or `min_df` close to :math:`0`) tend to
# overfit and therefore require a stronger regularization to compensate. Having
# less features require less regularization and less scoring time.
#
# The best accuracy scores are obtained when `alpha` is between :math:`10^{-6}`
# and :math:`10^0`, regardless of the hyperparameter `norm`.