Anomaly Detection

📹 Related Topic (covered in various lectures)

Detecting Unusual Patterns in Data

What is Anomaly Detection?

Goal: Identify data points that deviate significantly from the norm

Applications:

  • Fraud Detection: Unusual credit card transactions

  • Manufacturing: Defective products on assembly line

  • System Monitoring: Server failures, network intrusions

  • Healthcare: Abnormal patient vitals

Gaussian (Normal) Distribution Approach

Single Feature:

p(x) = (1/(2πσ²)) exp(-(x-μ)²/(2σ²))

Multivariate Gaussian:

p(x) = (1/(2π)^(n/2)|Σ|^(1/2)) exp(-1/2(x-μ)ᵀΣ⁻¹(x-μ))

Parameters:

  • μ: Mean vector

  • Σ: Covariance matrix

Anomaly Detection Algorithm

Training:

  1. Choose features xᵢ that might be indicative of anomalies

  2. Fit parameters μ, σ² (or Σ for multivariate)

    μⱼ = (1/m) Σᵢ xⱼ
    σⱼ² = (1/m) Σᵢ (xⱼ - μⱼ)²
    

Detection:

  1. Compute p(x_test)

  2. If p(x_test) < ε, flag as anomaly

  3. Choose ε on cross-validation set

Anomaly Detection vs Supervised Learning

Use Anomaly Detection when:

  • Very small number of positive (anomalous) examples

  • Large number of negative (normal) examples

  • Many different “types” of anomalies

  • Future anomalies may look nothing like current ones

Use Supervised Learning when:

  • Large number of both positive and negative examples

  • Future positive examples likely similar to training set

  • Can learn from positive examples

Examples:

Application

Approach

Fraud detection

Anomaly Detection

Manufacturing defects

Anomaly Detection

Email spam

Supervised Learning

Disease classification

Supervised Learning

Choosing Features

Good features:

  • Take on unusually large or small values for anomalies

  • Transform non-Gaussian features: log(x), √x, x^(1/3)

Feature combinations:

  • CPU load / network traffic

  • (CPU load)² / network traffic

Alternative Methods

1. One-Class SVM

  • Learns a boundary around normal data

  • No need to model p(x) explicitly

2. Isolation Forest

  • Ensemble of decision trees

  • Anomalies are easier to isolate (fewer splits needed)

3. Local Outlier Factor (LOF)

  • Density-based approach

  • Compares local density to neighbors

4. Autoencoders

  • Neural network approach

  • Anomalies have high reconstruction error

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.covariance import EllipticEnvelope
from sklearn.metrics import classification_report, confusion_matrix
from scipy.stats import multivariate_normal
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries loaded!")

10.1 Gaussian Anomaly Detection

What: Building a density-based anomaly detector from scratch

We model normal data with independent Gaussian distributions for each feature: \(p(x) = \prod_{j=1}^{n} \frac{1}{\sqrt{2\pi}\sigma_j} \exp\!\left(-\frac{(x_j - \mu_j)^2}{2\sigma_j^2}\right)\). A new point \(x\) is flagged as anomalous if \(p(x) < \varepsilon\), where \(\varepsilon\) is a threshold chosen on a validation set. The detector is trained only on normal examples – anomalies are defined as points that are unlikely under the learned distribution.

Why: The simplest probabilistic approach to anomaly detection

Gaussian anomaly detection is the baseline method for most anomaly detection tasks because it is fast (parameter estimation is closed-form), interpretable (you can inspect which features contributed to the low probability), and works well when features are approximately independent and normally distributed. Understanding this baseline is essential before moving to more complex methods like Isolation Forests or autoencoders, which relax the Gaussianity and independence assumptions.

class GaussianAnomalyDetector:
    def __init__(self, epsilon=0.01):
        self.epsilon = epsilon
        self.mu = None
        self.sigma = None
    
    def fit(self, X):
        self.mu = np.mean(X, axis=0)
        self.sigma = np.std(X, axis=0)
        return self
    
    def predict_proba(self, X):
        # Independent Gaussian for each feature
        p = np.ones(X.shape[0])
        for j in range(X.shape[1]):
            p *= (1 / (np.sqrt(2 * np.pi) * self.sigma[j])) * \
                np.exp(-((X[:, j] - self.mu[j])**2) / (2 * self.sigma[j]**2))
        return p
    
    def predict(self, X):
        p = self.predict_proba(X)
        return (p < self.epsilon).astype(int)  # 1 = anomaly

# Generate data with anomalies
np.random.seed(42)
X_normal = np.random.randn(300, 2) * 0.5 + np.array([0, 0])
X_anomaly = np.random.randn(20, 2) * 0.3 + np.array([3, 3])
X = np.vstack([X_normal, X_anomaly])
y_true = np.hstack([np.zeros(300), np.ones(20)])

# Fit detector
detector = GaussianAnomalyDetector(epsilon=0.005)
detector.fit(X_normal)  # Train only on normal data
y_pred = detector.predict(X)

# Visualize
plt.figure(figsize=(12, 8))

# Plot normal and detected
plt.scatter(X[y_pred == 0, 0], X[y_pred == 0, 1], 
           c='blue', marker='o', s=50, alpha=0.6, label='Normal')
plt.scatter(X[y_pred == 1, 0], X[y_pred == 1, 1], 
           c='red', marker='x', s=100, linewidths=2, label='Detected Anomaly')

# Plot decision boundary
xx, yy = np.meshgrid(np.linspace(-3, 5, 200), np.linspace(-3, 5, 200))
Z = detector.predict_proba(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contour(xx, yy, Z, levels=[detector.epsilon], colors='green', linewidths=2)

plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Gaussian Anomaly Detection', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nDetected {np.sum(y_pred)} anomalies")
print(f"True anomalies: {np.sum(y_true)}")
print(f"Accuracy: {np.mean(y_pred == y_true):.2%}")

10.2 Isolation Forest

What: Comparing tree-based and SVM-based anomaly detection methods

Isolation Forest detects anomalies by building an ensemble of random decision trees that recursively split the feature space. The key insight is that anomalies, being rare and different, require fewer random splits to isolate than normal points – their average path length in the tree is shorter. One-Class SVM takes a different approach: it learns a tight boundary (using a kernel function) around the normal data in a high-dimensional feature space.

Why: Handling non-Gaussian and non-linear anomaly patterns

Real-world anomalies rarely follow the neat Gaussian assumption. Isolation Forest requires no distributional assumptions and scales linearly with data size \(O(n \log n)\), making it the go-to method for large-scale anomaly detection in production systems. One-Class SVM is more flexible with kernel choices but slower and more sensitive to hyperparameters (\(\nu\), \(\gamma\)). Comparing these methods side-by-side reveals how different inductive biases affect detection accuracy.

# Isolation Forest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
y_pred_iso = iso_forest.fit_predict(X)
y_pred_iso = (y_pred_iso == -1).astype(int)  # Convert to 0/1

# One-Class SVM
ocsvm = OneClassSVM(nu=0.1, gamma='auto')
y_pred_svm = ocsvm.fit_predict(X)
y_pred_svm = (y_pred_svm == -1).astype(int)

# Compare methods
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

methods = [
    ('Gaussian', y_pred),
    ('Isolation Forest', y_pred_iso),
    ('One-Class SVM', y_pred_svm)
]

for idx, (name, preds) in enumerate(methods):
    axes[idx].scatter(X[preds == 0, 0], X[preds == 0, 1],
                     c='blue', marker='o', s=50, alpha=0.6, label='Normal')
    axes[idx].scatter(X[preds == 1, 0], X[preds == 1, 1],
                     c='red', marker='x', s=100, linewidths=2, label='Anomaly')
    
    acc = np.mean(preds == y_true)
    axes[idx].set_xlabel('Feature 1', fontsize=12)
    axes[idx].set_ylabel('Feature 2', fontsize=12)
    axes[idx].set_title(f'{name}\nAccuracy: {acc:.2%}', fontsize=13, fontweight='bold')
    axes[idx].legend(fontsize=10)
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

10.3 Evaluation Metrics

What: Measuring anomaly detection performance with precision, recall, and F1

Standard accuracy is misleading for anomaly detection because the classes are heavily imbalanced (e.g., 95% normal, 5% anomalous – a model that predicts “normal” for everything achieves 95% accuracy). Instead, we use precision (what fraction of detected anomalies are real?), recall (what fraction of real anomalies are detected?), and the F1-score (their harmonic mean). The tradeoff between precision and recall is controlled by the detection threshold \(\varepsilon\).

Why: Imbalanced evaluation is critical in real-world anomaly detection

In fraud detection, a false negative (missed fraud) can cost thousands of dollars, while a false positive (blocking a legitimate transaction) causes inconvenience. In manufacturing quality control, the costs are reversed. Understanding how to evaluate and tune the precision-recall tradeoff for your specific cost structure is what separates a useful anomaly detection system from a toy experiment.

from sklearn.metrics import precision_score, recall_score, f1_score

print("Performance Comparison:\n")
print(f"{'Method':<20} {'Precision':<12} {'Recall':<12} {'F1-Score':<12}")
print("-" * 60)

for name, preds in methods:
    prec = precision_score(y_true, preds)
    rec = recall_score(y_true, preds)
    f1 = f1_score(y_true, preds)
    print(f"{name:<20} {prec:<12.3f} {rec:<12.3f} {f1:<12.3f}")

10.4 Credit Card Fraud Detection (Simulated)

What: End-to-end anomaly detection on a realistic fraud scenario

We simulate a credit card transaction dataset where normal transactions cluster around moderate amounts during business hours, while fraudulent transactions tend to involve unusually large amounts at unusual hours (late night). An Isolation Forest is trained on the full dataset (unsupervised – no labels needed) with a contamination parameter estimating the fraction of anomalies.

Why: Connecting theory to a real-world application

Credit card fraud detection is one of the most impactful applications of anomaly detection, processing billions of transactions daily at companies like Visa and Mastercard. The simulation captures two key challenges of real fraud detection: (1) the extreme class imbalance (fraudulent transactions are less than 1% of total volume), and (2) the fact that fraud patterns evolve over time, so a purely supervised approach would miss novel fraud strategies. Anomaly detection provides an unsupervised safety net that can flag previously unseen fraud patterns.

# Simulate credit card transactions
np.random.seed(42)
n_normal = 1000
n_fraud = 50

# Normal transactions
amounts_normal = np.abs(np.random.randn(n_normal) * 50 + 100)
times_normal = np.random.rand(n_normal) * 24
X_normal_cc = np.column_stack([amounts_normal, times_normal])

# Fraudulent transactions (unusual amounts and times)
amounts_fraud = np.abs(np.random.randn(n_fraud) * 100 + 500)
times_fraud = np.random.choice([2, 3, 4, 23, 24], n_fraud) + np.random.randn(n_fraud) * 0.5
X_fraud_cc = np.column_stack([amounts_fraud, times_fraud])

X_cc = np.vstack([X_normal_cc, X_fraud_cc])
y_cc = np.hstack([np.zeros(n_normal), np.ones(n_fraud)])

# Train Isolation Forest
iso_cc = IsolationForest(contamination=0.05, random_state=42)
y_pred_cc = iso_cc.fit_predict(X_cc)
y_pred_cc = (y_pred_cc == -1).astype(int)

# Visualize
plt.figure(figsize=(12, 8))
plt.scatter(X_cc[y_pred_cc == 0, 0], X_cc[y_pred_cc == 0, 1],
           c='green', marker='o', s=30, alpha=0.5, label='Normal')
plt.scatter(X_cc[y_pred_cc == 1, 0], X_cc[y_pred_cc == 1, 1],
           c='red', marker='x', s=100, linewidths=2, label='Fraud (Detected)')

plt.xlabel('Transaction Amount ($)', fontsize=12)
plt.ylabel('Hour of Day', fontsize=12)
plt.title('Credit Card Fraud Detection', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nFraud Detection Results:")
print(f"Precision: {precision_score(y_cc, y_pred_cc):.3f}")
print(f"Recall: {recall_score(y_cc, y_pred_cc):.3f}")
print(f"F1-Score: {f1_score(y_cc, y_pred_cc):.3f}")

Key Takeaways

1. Gaussian Method

  • Simple and interpretable

  • Assumes features are independent

  • Works well for normal distributions

  • Fast to train and predict

2. Isolation Forest

  • Tree-based, handles non-linear patterns

  • Fast and scalable

  • No assumptions about distribution

  • Good default choice

3. One-Class SVM

  • Kernel methods for complex boundaries

  • Slower than Isolation Forest

  • Requires parameter tuning (nu, gamma)

4. Evaluation

  • Precision: What % of detected anomalies are real?

  • Recall: What % of real anomalies are detected?

  • F1-Score: Harmonic mean of precision and recall

  • Trade-off between false positives and false negatives

5. When to Use

  • Gaussian: Fast baseline, interpretable

  • Isolation Forest: General-purpose, scalable

  • One-Class SVM: When you need non-linear boundaries

  • Deep learning: Very high-dimensional data (autoencoders)

Practice Exercises

  1. Implement multivariate Gaussian detector

  2. Apply to real credit card fraud dataset

  3. Build autoencoder for anomaly detection

  4. Compare contamination parameter values

  5. Time series anomaly detection

References

  1. CS229 Lecture Notes on Anomaly Detection

  2. “Isolation Forest” - Liu et al. (2008)

  3. “Anomaly Detection: A Survey” - Chandola et al. (2009)

Next: Lecture 11: Learning Theory