Run this notebook: Open in Colab Open in Kaggle

Anomaly Detection¶

📹 Related Topic (covered in various lectures)

Detecting Unusual Patterns in Data¶

What is Anomaly Detection?¶

Goal: Identify data points that deviate significantly from the norm

Applications:

Fraud Detection: Unusual credit card transactions
Manufacturing: Defective products on assembly line
System Monitoring: Server failures, network intrusions
Healthcare: Abnormal patient vitals

Gaussian (Normal) Distribution Approach¶

Single Feature:

p(x) = (1/√(2πσ²)) exp(-(x-μ)²/(2σ²))

Multivariate Gaussian:

p(x) = (1/(2π)^(n/2)|Σ|^(1/2)) exp(-1/2(x-μ)ᵀΣ⁻¹(x-μ))

Parameters:

μ: Mean vector
Σ: Covariance matrix

Anomaly Detection Algorithm¶

Training:

Choose features xᵢ that might be indicative of anomalies

Fit parameters μ, σ² (or Σ for multivariate)

μⱼ = (1/m) Σᵢ xⱼ⁽ⁱ⁾
σⱼ² = (1/m) Σᵢ (xⱼ⁽ⁱ⁾ - μⱼ)²

Detection:

Compute p(x_test)
If p(x_test) < ε, flag as anomaly
Choose ε on cross-validation set

Anomaly Detection vs Supervised Learning¶

Use Anomaly Detection when:

Very small number of positive (anomalous) examples
Large number of negative (normal) examples
Many different “types” of anomalies
Future anomalies may look nothing like current ones

Use Supervised Learning when:

Large number of both positive and negative examples
Future positive examples likely similar to training set
Can learn from positive examples

Examples:

Application	Approach
Fraud detection	Anomaly Detection
Manufacturing defects	Anomaly Detection
Email spam	Supervised Learning
Disease classification	Supervised Learning

Choosing Features¶

Good features:

Take on unusually large or small values for anomalies
Transform non-Gaussian features: log(x), √x, x^(1/3)

Feature combinations:

CPU load / network traffic
(CPU load)² / network traffic

Alternative Methods¶

1. One-Class SVM

Learns a boundary around normal data
No need to model p(x) explicitly

2. Isolation Forest

Ensemble of decision trees
Anomalies are easier to isolate (fewer splits needed)

3. Local Outlier Factor (LOF)

Density-based approach
Compares local density to neighbors

4. Autoencoders

Neural network approach
Anomalies have high reconstruction error

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.covariance import EllipticEnvelope
from sklearn.metrics import classification_report, confusion_matrix
from scipy.stats import multivariate_normal
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries loaded!")

10.1 Gaussian Anomaly Detection¶

What: Building a density-based anomaly detector from scratch¶

We model normal data with independent Gaussian distributions for each feature: \(p(x) = \prod_{j=1}^{n} \frac{1}{\sqrt{2\pi}\sigma_j} \exp\!\left(-\frac{(x_j - \mu_j)^2}{2\sigma_j^2}\right)\). A new point \(x\) is flagged as anomalous if \(p(x) < \varepsilon\), where \(\varepsilon\) is a threshold chosen on a validation set. The detector is trained only on normal examples – anomalies are defined as points that are unlikely under the learned distribution.

Why: The simplest probabilistic approach to anomaly detection¶

Gaussian anomaly detection is the baseline method for most anomaly detection tasks because it is fast (parameter estimation is closed-form), interpretable (you can inspect which features contributed to the low probability), and works well when features are approximately independent and normally distributed. Understanding this baseline is essential before moving to more complex methods like Isolation Forests or autoencoders, which relax the Gaussianity and independence assumptions.

class GaussianAnomalyDetector:
    def __init__(self, epsilon=0.01):
        self.epsilon = epsilon
        self.mu = None
        self.sigma = None
    
    def fit(self, X):
        self.mu = np.mean(X, axis=0)
        self.sigma = np.std(X, axis=0)
        return self
    
    def predict_proba(self, X):
        # Independent Gaussian for each feature
        p = np.ones(X.shape[0])
        for j in range(X.shape[1]):
            p *= (1 / (np.sqrt(2 * np.pi) * self.sigma[j])) * \
                np.exp(-((X[:, j] - self.mu[j])**2) / (2 * self.sigma[j]**2))
        return p
    
    def predict(self, X):
        p = self.predict_proba(X)
        return (p < self.epsilon).astype(int)  # 1 = anomaly

# Generate data with anomalies
np.random.seed(42)
X_normal = np.random.randn(300, 2) * 0.5 + np.array([0, 0])
X_anomaly = np.random.randn(20, 2) * 0.3 + np.array([3, 3])
X = np.vstack([X_normal, X_anomaly])
y_true = np.hstack([np.zeros(300), np.ones(20)])

# Fit detector
detector = GaussianAnomalyDetector(epsilon=0.005)
detector.fit(X_normal)  # Train only on normal data
y_pred = detector.predict(X)

# Visualize
plt.figure(figsize=(12, 8))

# Plot normal and detected
plt.scatter(X[y_pred == 0, 0], X[y_pred == 0, 1], 
           c='blue', marker='o', s=50, alpha=0.6, label='Normal')
plt.scatter(X[y_pred == 1, 0], X[y_pred == 1, 1], 
           c='red', marker='x', s=100, linewidths=2, label='Detected Anomaly')

# Plot decision boundary
xx, yy = np.meshgrid(np.linspace(-3, 5, 200), np.linspace(-3, 5, 200))
Z = detector.predict_proba(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contour(xx, yy, Z, levels=[detector.epsilon], colors='green', linewidths=2)

plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Gaussian Anomaly Detection', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nDetected {np.sum(y_pred)} anomalies")
print(f"True anomalies: {np.sum(y_true)}")
print(f"Accuracy: {np.mean(y_pred == y_true):.2%}")

10.2 Isolation Forest¶

What: Comparing tree-based and SVM-based anomaly detection methods¶

Isolation Forest detects anomalies by building an ensemble of random decision trees that recursively split the feature space. The key insight is that anomalies, being rare and different, require fewer random splits to isolate than normal points – their average path length in the tree is shorter. One-Class SVM takes a different approach: it learns a tight boundary (using a kernel function) around the normal data in a high-dimensional feature space.

Why: Handling non-Gaussian and non-linear anomaly patterns¶

Real-world anomalies rarely follow the neat Gaussian assumption. Isolation Forest requires no distributional assumptions and scales linearly with data size \(O(n \log n)\), making it the go-to method for large-scale anomaly detection in production systems. One-Class SVM is more flexible with kernel choices but slower and more sensitive to hyperparameters (\(\nu\), \(\gamma\)). Comparing these methods side-by-side reveals how different inductive biases affect detection accuracy.

# Isolation Forest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
y_pred_iso = iso_forest.fit_predict(X)
y_pred_iso = (y_pred_iso == -1).astype(int)  # Convert to 0/1

# One-Class SVM
ocsvm = OneClassSVM(nu=0.1, gamma='auto')
y_pred_svm = ocsvm.fit_predict(X)
y_pred_svm = (y_pred_svm == -1).astype(int)

# Compare methods
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

methods = [
    ('Gaussian', y_pred),
    ('Isolation Forest', y_pred_iso),
    ('One-Class SVM', y_pred_svm)
]

for idx, (name, preds) in enumerate(methods):
    axes[idx].scatter(X[preds == 0, 0], X[preds == 0, 1],
                     c='blue', marker='o', s=50, alpha=0.6, label='Normal')
    axes[idx].scatter(X[preds == 1, 0], X[preds == 1, 1],
                     c='red', marker='x', s=100, linewidths=2, label='Anomaly')
    
    acc = np.mean(preds == y_true)
    axes[idx].set_xlabel('Feature 1', fontsize=12)
    axes[idx].set_ylabel('Feature 2', fontsize=12)
    axes[idx].set_title(f'{name}\nAccuracy: {acc:.2%}', fontsize=13, fontweight='bold')
    axes[idx].legend(fontsize=10)
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

10.3 Evaluation Metrics¶

What: Measuring anomaly detection performance with precision, recall, and F1¶

Standard accuracy is misleading for anomaly detection because the classes are heavily imbalanced (e.g., 95% normal, 5% anomalous – a model that predicts “normal” for everything achieves 95% accuracy). Instead, we use precision (what fraction of detected anomalies are real?), recall (what fraction of real anomalies are detected?), and the F1-score (their harmonic mean). The tradeoff between precision and recall is controlled by the detection threshold \(\varepsilon\).

Why: Imbalanced evaluation is critical in real-world anomaly detection¶

In fraud detection, a false negative (missed fraud) can cost thousands of dollars, while a false positive (blocking a legitimate transaction) causes inconvenience. In manufacturing quality control, the costs are reversed. Understanding how to evaluate and tune the precision-recall tradeoff for your specific cost structure is what separates a useful anomaly detection system from a toy experiment.

from sklearn.metrics import precision_score, recall_score, f1_score

print("Performance Comparison:\n")
print(f"{'Method':<20} {'Precision':<12} {'Recall':<12} {'F1-Score':<12}")
print("-" * 60)

for name, preds in methods:
    prec = precision_score(y_true, preds)
    rec = recall_score(y_true, preds)
    f1 = f1_score(y_true, preds)
    print(f"{name:<20} {prec:<12.3f} {rec:<12.3f} {f1:<12.3f}")

10.4 Credit Card Fraud Detection (Simulated)¶

What: End-to-end anomaly detection on a realistic fraud scenario¶

We simulate a credit card transaction dataset where normal transactions cluster around moderate amounts during business hours, while fraudulent transactions tend to involve unusually large amounts at unusual hours (late night). An Isolation Forest is trained on the full dataset (unsupervised – no labels needed) with a contamination parameter estimating the fraction of anomalies.

Why: Connecting theory to a real-world application¶

Credit card fraud detection is one of the most impactful applications of anomaly detection, processing billions of transactions daily at companies like Visa and Mastercard. The simulation captures two key challenges of real fraud detection: (1) the extreme class imbalance (fraudulent transactions are less than 1% of total volume), and (2) the fact that fraud patterns evolve over time, so a purely supervised approach would miss novel fraud strategies. Anomaly detection provides an unsupervised safety net that can flag previously unseen fraud patterns.

# Simulate credit card transactions
np.random.seed(42)
n_normal = 1000
n_fraud = 50

# Normal transactions
amounts_normal = np.abs(np.random.randn(n_normal) * 50 + 100)
times_normal = np.random.rand(n_normal) * 24
X_normal_cc = np.column_stack([amounts_normal, times_normal])

# Fraudulent transactions (unusual amounts and times)
amounts_fraud = np.abs(np.random.randn(n_fraud) * 100 + 500)
times_fraud = np.random.choice([2, 3, 4, 23, 24], n_fraud) + np.random.randn(n_fraud) * 0.5
X_fraud_cc = np.column_stack([amounts_fraud, times_fraud])

X_cc = np.vstack([X_normal_cc, X_fraud_cc])
y_cc = np.hstack([np.zeros(n_normal), np.ones(n_fraud)])

# Train Isolation Forest
iso_cc = IsolationForest(contamination=0.05, random_state=42)
y_pred_cc = iso_cc.fit_predict(X_cc)
y_pred_cc = (y_pred_cc == -1).astype(int)

# Visualize
plt.figure(figsize=(12, 8))
plt.scatter(X_cc[y_pred_cc == 0, 0], X_cc[y_pred_cc == 0, 1],
           c='green', marker='o', s=30, alpha=0.5, label='Normal')
plt.scatter(X_cc[y_pred_cc == 1, 0], X_cc[y_pred_cc == 1, 1],
           c='red', marker='x', s=100, linewidths=2, label='Fraud (Detected)')

plt.xlabel('Transaction Amount ($)', fontsize=12)
plt.ylabel('Hour of Day', fontsize=12)
plt.title('Credit Card Fraud Detection', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nFraud Detection Results:")
print(f"Precision: {precision_score(y_cc, y_pred_cc):.3f}")
print(f"Recall: {recall_score(y_cc, y_pred_cc):.3f}")
print(f"F1-Score: {f1_score(y_cc, y_pred_cc):.3f}")

Key Takeaways¶

1. Gaussian Method¶

Simple and interpretable
Assumes features are independent
Works well for normal distributions
Fast to train and predict

2. Isolation Forest¶

Tree-based, handles non-linear patterns
Fast and scalable
No assumptions about distribution
Good default choice

3. One-Class SVM¶

Kernel methods for complex boundaries
Slower than Isolation Forest
Requires parameter tuning (nu, gamma)

4. Evaluation¶

Precision: What % of detected anomalies are real?
Recall: What % of real anomalies are detected?
F1-Score: Harmonic mean of precision and recall
Trade-off between false positives and false negatives

5. When to Use¶

Gaussian: Fast baseline, interpretable
Isolation Forest: General-purpose, scalable
One-Class SVM: When you need non-linear boundaries
Deep learning: Very high-dimensional data (autoencoders)

Practice Exercises¶

Implement multivariate Gaussian detector
Apply to real credit card fraud dataset
Build autoencoder for anomaly detection
Compare contamination parameter values
Time series anomaly detection

References¶

CS229 Lecture Notes on Anomaly Detection
“Isolation Forest” - Liu et al. (2008)
“Anomaly Detection: A Survey” - Chandola et al. (2009)

Next: Lecture 11: Learning Theory