Run this notebook: Open in Colab Open in Kaggle

Plot Gmm Pdf¶

========================================= Density Estimation for a Gaussian mixture¶

Plot the density estimation of a mixture of two Gaussians. Data is generated from two Gaussians with different centers and covariance matrices.

Imports for GMM Density Estimation¶

A Gaussian mixture model is a universal density approximator: GaussianMixture with covariance_type='full' models the data distribution as a weighted sum of multivariate Gaussians, where each component has its own mean vector and full covariance matrix. The EM algorithm alternates between computing soft cluster assignments (responsibilities) and updating the component parameters to maximize the likelihood. With enough components, a GMM can approximate any continuous probability density to arbitrary accuracy, making it useful not just for clustering but for density estimation – computing p(x) at any point in the feature space via score_samples, which returns the log-likelihood.

The negative log-likelihood contour plot reveals the learned density surface: Calling clf.score_samples(XX) on a grid of points and negating the result produces a surface where valleys correspond to high-density regions (near cluster centers) and peaks correspond to low-density regions. The LogNorm colormap and logarithmically spaced contour levels (np.logspace(0, 3, 10)) are necessary because the likelihood varies over several orders of magnitude – linear spacing would compress all the interesting structure into a few contour lines. The two generated components (a shifted spherical Gaussian and a stretched Gaussian produced by the linear transformation C) have very different shapes, demonstrating that covariance_type='full' can capture both isotropic and anisotropic clusters simultaneously.

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import LogNorm

from sklearn import mixture

n_samples = 300

# generate random sample, two components
np.random.seed(0)

# generate spherical data centered on (20, 20)
shifted_gaussian = np.random.randn(n_samples, 2) + np.array([20, 20])

# generate zero centered stretched Gaussian data
C = np.array([[0.0, -0.7], [3.5, 0.7]])
stretched_gaussian = np.dot(np.random.randn(n_samples, 2), C)

# concatenate the two datasets into the final training set
X_train = np.vstack([shifted_gaussian, stretched_gaussian])

# fit a Gaussian Mixture Model with two components
clf = mixture.GaussianMixture(n_components=2, covariance_type="full")
clf.fit(X_train)

# display predicted scores by the model as a contour plot
x = np.linspace(-20.0, 30.0)
y = np.linspace(-20.0, 40.0)
X, Y = np.meshgrid(x, y)
XX = np.array([X.ravel(), Y.ravel()]).T
Z = -clf.score_samples(XX)
Z = Z.reshape(X.shape)

CS = plt.contour(
    X, Y, Z, norm=LogNorm(vmin=1.0, vmax=1000.0), levels=np.logspace(0, 3, 10)
)
CB = plt.colorbar(CS, shrink=0.8, extend="both")
plt.scatter(X_train[:, 0], X_train[:, 1], 0.8)

plt.title("Negative log-likelihood predicted by a GMM")
plt.axis("tight")
plt.show()