Run this notebook: Open in Colab Open in Kaggle

Density Estimation: Gaussian Mixture Models¶

Credits: Forked from PyCon 2015 Scikit-learn Tutorial by Jake VanderPlas

Here we’ll explore Gaussian Mixture Models, which is an unsupervised clustering & density estimation technique.

We’ll start with our standard set of initial imports

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# use seaborn plotting defaults
import seaborn as sns; sns.set()

Introducing Gaussian Mixture Models¶

We previously saw an example of K-Means, which is a clustering algorithm which is most often fit using an expectation-maximization approach.

Here we’ll consider an extension to this which is suitable for both clustering and density estimation.

For example, imagine we have some one-dimensional data in a particular distribution:

np.random.seed(2)
x = np.concatenate([np.random.normal(0, 2, 2000),
                    np.random.normal(5, 5, 2000),
                    np.random.normal(3, 0.5, 600)])
plt.hist(x, 80, normed=True)
plt.xlim(-10, 20);

Fitting a GMM with 4 components approximates the density by finding the best combination of 4 Gaussians (each with its own mean, variance, and weight). The red curve shows the fitted density overlaid on the histogram. Unlike K-Means which assigns each point to exactly one cluster, GMM assigns probabilistic memberships – each point has a probability of belonging to each component.

from sklearn.mixture import GMM
clf = GMM(4, n_iter=500, random_state=3).fit(x)
xpdf = np.linspace(-10, 20, 1000)
density = np.exp(clf.score(xpdf))

plt.hist(x, 80, normed=True, alpha=0.5)
plt.plot(xpdf, density, '-r')
plt.xlim(-10, 20);

Note that this density is fit using a mixture of Gaussians, which we can examine by looking at the means_, covars_, and weights_ attributes:

clf.means_

clf.covars_

clf.weights_

plt.hist(x, 80, normed=True, alpha=0.3)
plt.plot(xpdf, density, '-r')

for i in range(clf.n_components):
    pdf = clf.weights_[i] * stats.norm(clf.means_[i, 0],
                                       np.sqrt(clf.covars_[i, 0])).pdf(xpdf)
    plt.fill(xpdf, pdf, facecolor='gray',
             edgecolor='none', alpha=0.3)
plt.xlim(-10, 20);

These individual Gaussian distributions are fit using an expectation-maximization method, much as in K means, except that rather than explicit cluster assignment, the posterior probability is used to compute the weighted mean and covariance. Somewhat surprisingly, this algorithm provably converges to the optimum (though the optimum is not necessarily global).

How many Gaussians?¶

Given a model, we can use one of several means to evaluate how well it fits the data. For example, there is the Aikaki Information Criterion (AIC) and the Bayesian Information Criterion (BIC)

print(clf.bic(x))
print(clf.aic(x))

Plotting BIC and AIC as a function of the number of components reveals the optimal model complexity. Both criteria penalize models for having too many parameters, balancing fit quality against complexity. The minimum of the curve indicates the best number of Gaussians. BIC tends to favor simpler models than AIC, which can be preferable when you want to avoid overfitting.

n_estimators = np.arange(1, 10)
clfs = [GMM(n, n_iter=1000).fit(x) for n in n_estimators]
bics = [clf.bic(x) for clf in clfs]
aics = [clf.aic(x) for clf in clfs]

plt.plot(n_estimators, bics, label='BIC')
plt.plot(n_estimators, aics, label='AIC')
plt.legend();

Both AIC and BIC agree that 4 components is the optimal choice, which matches the 3 Gaussian distributions we used to generate the data (the algorithm found an additional component to handle the overlap region).

Example: GMM For Outlier Detection¶

GMM is what’s known as a Generative Model: it’s a probabilistic model from which a dataset can be generated. One thing that generative models can be useful for is outlier detection: we can simply evaluate the likelihood of each point under the generative model; the points with a suitably low likelihood (where “suitable” is up to your own bias/variance preference) can be labeld outliers.

Let’s take a look at this by defining a new dataset with some outliers:

np.random.seed(0)

# Add 20 outliers
true_outliers = np.sort(np.random.randint(0, len(x), 20))
y = x.copy()
y[true_outliers] += 50 * np.random.randn(20)

clf = GMM(4, n_iter=500, random_state=0).fit(y)
xpdf = np.linspace(-10, 20, 1000)
density_noise = np.exp(clf.score(xpdf))

plt.hist(y, 80, normed=True, alpha=0.5)
plt.plot(xpdf, density_noise, '-r')
#plt.xlim(-10, 20);

The score_samples() method returns the log-likelihood of each data point under the fitted model. Points with very low log-likelihood are unlikely to have been generated by the model, making them candidate outliers. Plotting log-likelihood against the data values reveals a clear separation between normal points and outliers.

log_likelihood = clf.score_samples(y)[0]
plt.plot(y, log_likelihood, '.k');

detected_outliers = np.where(log_likelihood < -9)[0]

print("true outliers:")
print(true_outliers)
print("\ndetected outliers:")
print(detected_outliers)

The algorithm misses a few of these points, which is to be expected (some of the “outliers” actually land in the middle of the distribution!)

Here are the outliers that were missed:

set(true_outliers) - set(detected_outliers)

Checking false positives – points that were flagged as outliers but were actually normal data. A small number of false positives is acceptable and expected, since some normal points may land in low-density regions by chance.

set(detected_outliers) - set(true_outliers)

Finally, we should note that although all of the above is done in one dimension, GMM does generalize to multiple dimensions, as we’ll see in the breakout session.

Other Density Estimators¶

The other main density estimator that you might find useful is Kernel Density Estimation, which is available via sklearn.neighbors.KernelDensity. In some ways, this can be thought of as a generalization of GMM where there is a gaussian placed at the location of every training point!

from sklearn.neighbors import KernelDensity
kde = KernelDensity(0.15).fit(x[:, None])
density_kde = np.exp(kde.score_samples(xpdf[:, None]))

plt.hist(x, 80, normed=True, alpha=0.5)
plt.plot(xpdf, density, '-b', label='GMM')
plt.plot(xpdf, density_kde, '-r', label='KDE')
plt.xlim(-10, 20)
plt.legend();

All of these density estimators can be viewed as Generative models of the data: that is, that is, the model tells us how more data can be created which fits the model.