Introduction to scikit-learnΒΆ
scikit-learn is the most widely used machine learning library in Python. It provides a consistent API for supervised learning (classification, regression) and unsupervised learning (clustering, dimensionality reduction), along with tools for model selection, preprocessing, and evaluation. This notebook introduces the core API pattern β fit, predict, score β and demonstrates it with the classic Iris dataset and a K-Nearest Neighbors classifier.
Credits: Forked from PyCon 2015 Scikit-learn Tutorial by Jake VanderPlas
Machine Learning Models Cheat Sheet
Estimators
Introduction: Iris Dataset
K-Nearest Neighbors Classifier
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn;
from sklearn.linear_model import LinearRegression
from scipy import stats
import pylab as pl
seaborn.set()
Machine Learning Models Cheat SheetΒΆ
The scikit-learn algorithm cheat sheet is an invaluable reference for choosing the right model. It guides you through a decision tree based on your data characteristics: how many samples you have, whether you are predicting a category or a quantity, and whether your data is labeled. Bookmark this flowchart β experienced data scientists refer to it regularly when starting new projects.
from IPython.display import Image
Image("http://scikit-learn.org/dev/_static/ml_map.png", width=800)
The Estimator APIΒΆ
Every algorithm in scikit-learn follows the Estimator interface, which provides a uniform workflow regardless of the underlying model. The key methods are fit() (learn from data), predict() (generate predictions), and score() (evaluate performance). This consistency means that once you learn the API with one model, switching to another β from logistic regression to random forests to SVMs β requires changing only the constructor call while keeping the rest of your pipeline intact.
Given a scikit-learn estimator object named model, the following methods are available:
Available in all Estimators
model.fit(): fit training data. For supervised learning applications, this accepts two arguments: the dataXand the labelsy(e.g.model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the dataX(e.g.model.fit(X)).
Available in supervised estimators
model.predict(): given a trained model, predict the label of a new set of data. This method accepts one argument, the new dataX_new(e.g.model.predict(X_new)), and returns the learned label for each object in the array.model.predict_proba(): For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned bymodel.predict().model.score(): for classification or regression problems, most (all?) estimators implement a score method. Scores are between 0 and 1, with a larger score indicating a better fit.
Available in unsupervised estimators
model.predict(): predict labels in clustering algorithms.model.transform(): given an unsupervised model, transform new data into the new basis. This also accepts one argumentX_new, and returns the new representation of the data based on the unsupervised model.model.fit_transform(): some estimators implement this method, which more efficiently performs a fit and a transform on the same input data.
The Iris DatasetΒΆ
The Iris dataset is the βhello worldβ of machine learning β 150 samples of iris flowers from three species, each described by four measurements (sepal length, sepal width, petal length, petal width). It is included directly in scikit-learn via load_iris(). Inspecting the .data shape, .target values, and .feature_names is always the first step when working with a new dataset. The scatter plot below reveals that petal measurements separate the three species more cleanly than sepal measurements, which is a visual hint about which features will be most predictive.
from sklearn.datasets import load_iris
iris = load_iris()
n_samples, n_features = iris.data.shape
print(iris.keys())
print((n_samples, n_features))
print(iris.data.shape)
print(iris.target.shape)
print(iris.target_names)
print(iris.feature_names)
import numpy as np
import matplotlib.pyplot as plt
# 'sepal width (cm)'
x_index = 1
# 'petal length (cm)'
y_index = 2
# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])
plt.scatter(iris.data[:, x_index], iris.data[:, y_index],
c=iris.target, cmap=plt.cm.get_cmap('RdYlBu', 3))
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.clim(-0.5, 2.5)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index]);
K-Nearest Neighbors ClassifierΒΆ
The K-Nearest Neighbors (KNN) algorithm is a method used for algorithm used for classification or for regression. In both cases, the input consists of the k closest training examples in the feature space. Given a new, unknown observation, look up which points have the closest features and assign the predominant class.
from sklearn import neighbors, datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target
# create the model
knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform')
# fit the model
knn.fit(X, y)
# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?
X_pred = [3, 5, 4, 2]
result = knn.predict([X_pred, ])
print(iris.target_names[result])
print(iris.target_names)
print(knn.predict_proba([X_pred, ]))
from fig_code import plot_iris_knn
plot_iris_knn()
Note we see overfitting in the K-Nearest Neighbors model above. Weβll be addressing overfitting and model validation in a later notebook.