Object Detection: YOLO, DETR & BeyondΒΆ

Bounding box detection, segmentation, YOLO, DETR, and grounding DINO for real-world object detection tasks.

# Install dependencies
# !pip install ultralytics opencv-python pillow matplotlib

Bounding Box FundamentalsΒΆ

Object Detection: Advanced Theory and Architecture EvolutionΒΆ

1. Problem FormulationΒΆ

Object detection combines classification and localization:

Input: Image \(I \in \mathbb{R}^{H \times W \times 3}\)

Output: Set of detections \(\mathcal{D} = \{(b_i, c_i, p_i)\}_{i=1}^N\) where:

  • \(b_i = (x, y, w, h)\): Bounding box coordinates

  • \(c_i \in \{1, \ldots, C\}\): Class label

  • \(p_i \in [0, 1]\): Confidence score

Challenges:

  1. Variable number of objects per image

  2. Different object scales

  3. Occlusion and truncation

  4. Real-time inference requirements

2. R-CNN Family: Two-Stage DetectorsΒΆ

A. R-CNN (2014) - Regions with CNNΒΆ

Pipeline:

  1. Selective Search: Generate ~2000 region proposals

  2. Warp: Resize each region to fixed size (e.g., 227Γ—227)

  3. CNN: Extract features with AlexNet/VGG

  4. Classify: SVM classifier for each class

  5. Regress: Bounding box refinement

Loss Function:

\[\mathcal{L} = \mathcal{L}_{\text{cls}}(p, c) + \lambda [c \geq 1] \mathcal{L}_{\text{loc}}(t, g)\]

where:

  • \(\mathcal{L}_{\text{cls}}\): Classification loss (cross-entropy)

  • \(\mathcal{L}_{\text{loc}}\): Localization loss (smooth L1)

  • \(t\): Predicted box offsets

  • \(g\): Ground truth offsets

  • \([c \geq 1]\): Indicator (only regress for objects, not background)

Box Parameterization:

\[t_x = (x - x_a) / w_a, \quad t_y = (y - y_a) / h_a\]
\[t_w = \log(w / w_a), \quad t_h = \log(h / h_a)\]

where \((x_a, y_a, w_a, h_a)\) is the anchor box.

Limitations:

  • Slow: ~47s per image

  • Multi-stage training (CNN, SVM, bbox regressor)

  • Disk-heavy feature caching

B. Fast R-CNN (2015)ΒΆ

Key Innovation: Share conv computation across proposals

Architecture:

Image β†’ CNN (entire image) β†’ RoI Pooling β†’ FC layers β†’ {cls, bbox}
          ↓
      Feature Map
          ↑
    Region Proposals (Selective Search)

RoI Pooling:

For region \(r\) with size \(h_r \times w_r\), divide into \(H \times W\) grid:

\[\text{RoI-Pool}(r, F) = \max_{(i,j) \in \text{bin}(h,w)} F[i, j]\]

Output: Fixed \(H \times W\) feature map regardless of input size.

Multi-task Loss:

\[\mathcal{L} = \mathcal{L}_{\text{cls}}(p, u) + \lambda [u \geq 1] \mathcal{L}_{\text{loc}}(t^u, v)\]

where \(u\) is true class and \(v\) is true box.

Smooth L1 Loss:

\[\mathcal{L}_{\text{loc}}(t, v) = \sum_{i \in \{x,y,w,h\}} \text{smooth}_{L1}(t_i - v_i)\]
\[\begin{split}\text{smooth}_{L1}(x) = \begin{cases} 0.5x^2 & \text{if } |x| < 1 \\ |x| - 0.5 & \text{otherwise} \end{cases}\end{split}\]

Advantages:

  • 9Γ— faster training, 140Γ— faster inference than R-CNN

  • End-to-end training

  • Higher mAP

Remaining Bottleneck: Selective Search (2s per image)

C. Faster R-CNN (2015)ΒΆ

Key Innovation: Region Proposal Network (RPN)

RPN Architecture:

For each position on feature map, use k anchors with different scales/ratios:

\[\text{Anchors}: \{(w_i, h_i)\}_{i=1}^k\]

Common: 3 scales Γ— 3 ratios = 9 anchors per location

RPN Outputs:

  • Objectness score: \(p_{\text{obj}} \in [0, 1]\) (is object?)

  • Box refinement: \((t_x, t_y, t_w, t_h)\)

RPN Loss:

\[\mathcal{L}_{\text{RPN}} = \frac{1}{N_{\text{cls}}} \sum_i \mathcal{L}_{\text{cls}}(p_i, p_i^*) + \frac{\lambda}{N_{\text{reg}}} \sum_i p_i^* \mathcal{L}_{\text{reg}}(t_i, t_i^*)\]

where \(p_i^* = 1\) if anchor is positive (IoU > 0.7 with GT).

Training Strategy (4-step alternating):

  1. Train RPN

  2. Train Fast R-CNN with RPN proposals

  3. Fine-tune RPN with fixed detector

  4. Fine-tune Fast R-CNN with fixed RPN

Modern: Joint end-to-end training with shared conv layers.

Performance: 200ms per image (GPU), 73.2% mAP (PASCAL VOC)

3. YOLO Family: One-Stage DetectorsΒΆ

A. YOLOv1 (2016) - You Only Look OnceΒΆ

Philosophy: Treat detection as regression problem.

Architecture:

  1. Divide image into \(S \times S\) grid (e.g., 7Γ—7)

  2. Each cell predicts:

    • \(B\) bounding boxes (e.g., 2)

    • Box confidence: \(P(\text{Object}) \times \text{IoU}\)

    • \(C\) class probabilities

Output Tensor: \(S \times S \times (B \cdot 5 + C)\)

For \(S=7, B=2, C=20\): \(7 \times 7 \times 30\)

Loss Function (Multi-part):

\[\mathcal{L} = \lambda_{\text{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{obj}} [(x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2]\]
\[+ \lambda_{\text{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{obj}} [(\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2]\]
\[+ \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{obj}} (C_i - \hat{C}_i)^2\]
\[+ \lambda_{\text{noobj}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{noobj}} (C_i - \hat{C}_i)^2\]
\[+ \sum_{i=0}^{S^2} \mathbb{1}_i^{\text{obj}} \sum_{c \in \text{classes}} (p_i(c) - \hat{p}_i(c))^2\]

Weight terms:

  • \(\lambda_{\text{coord}} = 5\): Increase localization loss importance

  • \(\lambda_{\text{noobj}} = 0.5\): Decrease background confidence loss

  • \(\sqrt{w}, \sqrt{h}\): Make loss more sensitive to small box errors

Advantages:

  • Extremely fast: 45 FPS (real-time)

  • Global context (sees entire image)

  • Fewer false positives on background

Limitations:

  • Struggles with small objects (grid limitation)

  • Each cell can only detect one object

  • Lower mAP than two-stage methods

B. YOLOv2 / YOLO9000 (2016)ΒΆ

Improvements:

  1. Batch Normalization: After every conv layer (+2% mAP)

  2. High-Resolution Classifier: Pre-train on 448Γ—448 instead of 224Γ—224

  3. Anchor Boxes: Like Faster R-CNN (use k-means on dataset to find anchors)

  4. Multi-Scale Training: Train on {320, 352, …, 608} randomly

  5. Passthrough Layer: Concat high-res features for small objects

Dimension Priors:

Run k-means (k=5) on training boxes with IoU distance:

\[d(\text{box}, \text{centroid}) = 1 - \text{IoU}(\text{box}, \text{centroid})\]

Learns dataset-specific anchor shapes (e.g., tall for people, wide for cars).

C. YOLOv3 (2018)ΒΆ

Multi-Scale Predictions:

Detect at 3 scales using Feature Pyramid Network (FPN):

  • Large objects: 13Γ—13 grid (stride 32)

  • Medium objects: 26Γ—26 grid (stride 16)

  • Small objects: 52Γ—52 grid (stride 8)

9 anchors total: 3 per scale

Darknet-53 Backbone:

53 conv layers with residual connections (similar to ResNet).

Logistic Regression for Objectness:

Replace softmax with sigmoid:

\[P(\text{obj}) = \sigma(t_o)\]

Allows one box to belong to multiple classes (e.g., β€œWoman” + β€œPerson”).

Performance: 33 ms (30 FPS), 57.9% AP50 (COCO)

D. YOLOv4 (2020) - Bag of Freebies/SpecialsΒΆ

Bag of Freebies (no inference cost):

  • Mosaic data augmentation (4 images β†’ 1)

  • Self-adversarial training

  • CIoU loss (Complete IoU)

  • Label smoothing

Bag of Specials (slight cost):

  • Mish activation: \(\text{Mish}(x) = x \cdot \tanh(\ln(1 + e^x))\)

  • CSPDarknet53 backbone (Cross-Stage Partial)

  • SPP (Spatial Pyramid Pooling)

  • PAN (Path Aggregation Network)

CIoU Loss (Complete IoU):

\[\mathcal{L}_{\text{CIoU}} = 1 - \text{IoU} + \frac{\rho^2(b, b^{gt})}{c^2} + \alpha v\]

where:

  • \(\rho\): Euclidean distance between box centers

  • \(c\): Diagonal of smallest enclosing box

  • \(v = \frac{4}{\pi^2} (\arctan \frac{w^{gt}}{h^{gt}} - \arctan \frac{w}{h})^2\): Aspect ratio consistency

Performance: 65 FPS, 43.5% AP (COCO)

E. YOLOv5-v8 (Modern)ΒΆ

YOLOv8 Architecture (latest):

Input (640Γ—640)
    ↓
CSPDarknet Backbone (feature extraction)
    ↓
C2f modules (faster C3)
    ↓
PAN-FPN Neck (multi-scale fusion)
    ↓
Decoupled Head (separate cls/box branches)
    ↓
{bbox, objectness, class} predictions

Anchor-Free Detection:

Direct regression of box coordinates from grid cells (no predefined anchors).

TAL (Task-Aligned Learning):

\[t = s^\alpha \cdot u^\beta\]

where:

  • \(s\): Classification score

  • \(u\): IoU

  • \(\alpha, \beta\): Hyperparameters

Aligns classification and localization quality.

4. Loss Functions EvolutionΒΆ

Loss

Formula

Focus

L1

\(|x - \hat{x}|\)

Simple, not scale-invariant

Smooth L1

\(\begin{cases} 0.5x^2 & |x| < 1 \\ |x| - 0.5 & \text{else} \end{cases}\)

Less sensitive to outliers

IoU

\(1 - \frac{\text{Intersection}}{\text{Union}}\)

Invariant to scale

GIoU

\(\text{IoU} - \frac{|C \setminus (A \cup B)|}{|C|}\)

Handles non-overlapping

DIoU

\(\text{IoU} - \frac{d^2}{c^2}\)

Minimizes center distance

CIoU

\(\text{DIoU} - \alpha v\)

Aspect ratio consistency

Why IoU-based losses?

Traditional L1/L2 on \((x, y, w, h)\) don’t directly optimize detection metric (IoU).

GIoU (Generalized IoU) gradient even when boxes don’t overlap:

\[\text{GIoU} = \text{IoU} - \frac{|C| - |A \cup B|}{|C|}\]

where \(C\) is smallest enclosing box.

5. Evaluation MetricsΒΆ

Precision & Recall:

\[\text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN}\]

Average Precision (AP):

Area under Precision-Recall curve:

\[\text{AP} = \int_0^1 p(r) dr\]

mAP (mean AP): Average AP across all classes

AP@IoU=0.5 (AP50): Detection correct if IoU β‰₯ 0.5

AP@[0.5:0.95] (COCO metric): Average over IoU thresholds {0.5, 0.55, …, 0.95}

Why AP, not accuracy?

  • Handles class imbalance

  • Captures both precision and recall

  • Threshold-independent

6. Modern TechniquesΒΆ

A. Feature Pyramid Networks (FPN):

Combine features from multiple scales:

Bottom-up: C2 β†’ C3 β†’ C4 β†’ C5
             ↓    ↓    ↓    ↓
Top-down:   P2 ← P3 ← P4 ← P5

Lateral connections with 1Γ—1 conv for dimension matching.

B. Focal Loss (RetinaNet):

Address class imbalance:

\[\text{FL}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)\]

where \(p_t = p\) if \(y=1\), else \(1-p\).

Down-weights easy examples (high \(p_t\)), focuses on hard negatives.

C. Deformable Convolutions:

Learn spatial sampling offsets:

\[y(p_0) = \sum_{p_n \in \mathcal{R}} w(p_n) \cdot x(p_0 + p_n + \Delta p_n)\]

where \(\Delta p_n\) are learned offsets. Adapts to object deformation.

D. Attention Mechanisms:

  • Spatial attention: Where to look (e.g., CBAM)

  • Channel attention: What features matter (e.g., SE-Net)

7. Comparison: Two-Stage vs One-StageΒΆ

Aspect

Two-Stage (Faster R-CNN)

One-Stage (YOLO/SSD)

Speed

Slower (region proposals)

Faster (direct regression)

Accuracy

Higher mAP

Lower mAP (improving)

Small Objects

Better (RoI pooling)

Challenging

Complexity

More complex

Simpler

Use Case

High-accuracy needed

Real-time critical

Modern Trend: Gap is closingβ€”YOLOv8 matches Faster R-CNN accuracy while being much faster.

import numpy as np
from typing import List, Tuple
from dataclasses import dataclass

@dataclass
class BoundingBox:
    """Bounding box representation"""
    x: float  # Top-left x
    y: float  # Top-left y
    width: float
    height: float
    confidence: float
    class_id: int
    class_name: str
    
    @property
    def x_min(self) -> float:
        return self.x
    
    @property
    def y_min(self) -> float:
        return self.y
    
    @property
    def x_max(self) -> float:
        return self.x + self.width
    
    @property
    def y_max(self) -> float:
        return self.y + self.height
    
    @property
    def area(self) -> float:
        return self.width * self.height
    
    def to_xyxy(self) -> Tuple[float, float, float, float]:
        """Convert to [x_min, y_min, x_max, y_max] format"""
        return (self.x_min, self.y_min, self.x_max, self.y_max)
    
    def to_xywh(self) -> Tuple[float, float, float, float]:
        """Convert to [x, y, width, height] format"""
        return (self.x, self.y, self.width, self.height)

def compute_iou(box1: BoundingBox, box2: BoundingBox) -> float:
    """Compute Intersection over Union (IoU)"""
    # Intersection coordinates
    x_min = max(box1.x_min, box2.x_min)
    y_min = max(box1.y_min, box2.y_min)
    x_max = min(box1.x_max, box2.x_max)
    y_max = min(box1.y_max, box2.y_max)
    
    # Intersection area
    if x_max < x_min or y_max < y_min:
        return 0.0
    
    intersection = (x_max - x_min) * (y_max - y_min)
    
    # Union area
    union = box1.area + box2.area - intersection
    
    return intersection / union if union > 0 else 0.0

# Test IoU
box1 = BoundingBox(10, 10, 50, 50, 0.9, 0, "person")
box2 = BoundingBox(30, 30, 50, 50, 0.8, 0, "person")
box3 = BoundingBox(100, 100, 50, 50, 0.7, 1, "car")

print(f"IoU(box1, box2) = {compute_iou(box1, box2):.3f}  # Overlapping")
print(f"IoU(box1, box3) = {compute_iou(box1, box3):.3f}  # Non-overlapping")

Non-Maximum Suppression (NMS)ΒΆ

def non_max_suppression(boxes: List[BoundingBox], iou_threshold: float = 0.5) -> List[BoundingBox]:
    """Apply Non-Maximum Suppression to remove duplicate detections"""
    if not boxes:
        return []
    
    # Sort by confidence (highest first)
    boxes = sorted(boxes, key=lambda b: b.confidence, reverse=True)
    
    keep = []
    
    while boxes:
        # Take box with highest confidence
        best_box = boxes.pop(0)
        keep.append(best_box)
        
        # Remove boxes with high IoU
        boxes = [
            box for box in boxes
            if compute_iou(best_box, box) < iou_threshold
            or box.class_id != best_box.class_id  # Different class
        ]
    
    return keep

# Test NMS
detections = [
    BoundingBox(10, 10, 50, 50, 0.95, 0, "person"),
    BoundingBox(12, 12, 50, 50, 0.90, 0, "person"),  # Similar to first
    BoundingBox(15, 15, 50, 50, 0.85, 0, "person"),  # Similar to first
    BoundingBox(100, 100, 50, 50, 0.92, 1, "car"),
]

filtered = non_max_suppression(detections, iou_threshold=0.5)

print(f"Before NMS: {len(detections)} boxes")
print(f"After NMS:  {len(filtered)} boxes")
print("\nKept boxes:")
for box in filtered:
    print(f"  {box.class_name} @ ({box.x:.0f}, {box.y:.0f}): {box.confidence:.2f}")
# Advanced NMS Variants and IoU Implementations

import numpy as np
from typing import List, Tuple
import matplotlib.pyplot as plt

# ============================================================
# 1. IoU Variants Implementation
# ============================================================

def compute_giou(box1: BoundingBox, box2: BoundingBox) -> float:
    """
    Generalized IoU (GIoU) - handles non-overlapping boxes.
    
    GIoU = IoU - |C \ (A βˆͺ B)| / |C|
    
    where C is the smallest enclosing box.
    """
    # Standard IoU calculation
    iou = compute_iou(box1, box2)
    
    # Smallest enclosing box
    x_min = min(box1.x_min, box2.x_min)
    y_min = min(box1.y_min, box2.y_min)
    x_max = max(box1.x_max, box2.x_max)
    y_max = max(box1.y_max, box2.y_max)
    
    c_area = (x_max - x_min) * (y_max - y_min)
    union = box1.area + box2.area - (iou * (box1.area + box2.area) / (1 + iou) if iou > 0 else 0)
    
    # Union for actual calculation
    x_min_i = max(box1.x_min, box2.x_min)
    y_min_i = max(box1.y_min, box2.y_min)
    x_max_i = min(box1.x_max, box2.x_max)
    y_max_i = min(box1.y_max, box2.y_max)
    
    intersection = max(0, x_max_i - x_min_i) * max(0, y_max_i - y_min_i)
    union = box1.area + box2.area - intersection
    
    giou = iou - (c_area - union) / c_area if c_area > 0 else iou
    
    return giou

def compute_diou(box1: BoundingBox, box2: BoundingBox) -> float:
    """
    Distance IoU (DIoU) - considers center distance.
    
    DIoU = IoU - ρ²(b, b_gt) / c²
    
    where ρ is Euclidean distance between centers,
    c is diagonal of smallest enclosing box.
    """
    iou = compute_iou(box1, box2)
    
    # Center points
    cx1 = box1.x + box1.width / 2
    cy1 = box1.y + box1.height / 2
    cx2 = box2.x + box2.width / 2
    cy2 = box2.y + box2.height / 2
    
    # Center distance
    center_dist_sq = (cx1 - cx2) ** 2 + (cy1 - cy2) ** 2
    
    # Smallest enclosing box diagonal
    x_min = min(box1.x_min, box2.x_min)
    y_min = min(box1.y_min, box2.y_min)
    x_max = max(box1.x_max, box2.x_max)
    y_max = max(box1.y_max, box2.y_max)
    
    c_diag_sq = (x_max - x_min) ** 2 + (y_max - y_min) ** 2
    
    diou = iou - center_dist_sq / c_diag_sq if c_diag_sq > 0 else iou
    
    return diou

def compute_ciou(box1: BoundingBox, box2: BoundingBox) -> float:
    """
    Complete IoU (CIoU) - includes aspect ratio consistency.
    
    CIoU = DIoU - Ξ±Β·v
    
    where v measures aspect ratio consistency,
    Ξ± is a trade-off parameter.
    """
    diou = compute_diou(box1, box2)
    
    # Aspect ratio term
    v = (4 / (np.pi ** 2)) * (
        np.arctan(box1.width / (box1.height + 1e-7)) -
        np.arctan(box2.width / (box2.height + 1e-7))
    ) ** 2
    
    # Trade-off parameter
    iou = compute_iou(box1, box2)
    alpha = v / (1 - iou + v + 1e-7)
    
    ciou = diou - alpha * v
    
    return ciou

# ============================================================
# 2. Advanced NMS Variants
# ============================================================

def soft_nms(boxes: List[BoundingBox], 
             sigma: float = 0.5,
             score_threshold: float = 0.001) -> List[BoundingBox]:
    """
    Soft-NMS: Decay scores instead of hard suppression.
    
    Instead of removing boxes, decay their confidence:
    
    s_i = s_i Β· exp(-IoUΒ²(M, b_i) / Οƒ)
    
    Better for occluded objects.
    """
    if not boxes:
        return []
    
    # Create mutable copy with scores
    boxes_with_scores = [(box, box.confidence) for box in boxes]
    keep = []
    
    while boxes_with_scores:
        # Find box with max score
        max_idx = max(range(len(boxes_with_scores)), 
                     key=lambda i: boxes_with_scores[i][1])
        best_box, best_score = boxes_with_scores.pop(max_idx)
        
        if best_score < score_threshold:
            break
        
        keep.append(best_box)
        
        # Decay scores of remaining boxes
        updated = []
        for box, score in boxes_with_scores:
            if box.class_id == best_box.class_id:
                iou = compute_iou(best_box, box)
                # Gaussian decay
                new_score = score * np.exp(-(iou ** 2) / sigma)
                updated.append((box, new_score))
            else:
                updated.append((box, score))
        
        boxes_with_scores = updated
    
    return keep

def nms_with_giou(boxes: List[BoundingBox], 
                  iou_threshold: float = 0.5) -> List[BoundingBox]:
    """NMS using GIoU instead of IoU for better overlap handling."""
    if not boxes:
        return []
    
    boxes = sorted(boxes, key=lambda b: b.confidence, reverse=True)
    keep = []
    
    while boxes:
        best_box = boxes.pop(0)
        keep.append(best_box)
        
        boxes = [
            box for box in boxes
            if compute_giou(best_box, box) < iou_threshold
            or box.class_id != best_box.class_id
        ]
    
    return keep

# ============================================================
# 3. Visualization of IoU Variants
# ============================================================

# Create test boxes
box_a = BoundingBox(20, 20, 60, 60, 0.9, 0, "obj")
box_b = BoundingBox(50, 50, 60, 60, 0.8, 0, "obj")  # Overlapping
box_c = BoundingBox(100, 20, 40, 80, 0.85, 0, "obj")  # Non-overlapping

boxes_to_test = [
    ("Overlapping", box_a, box_b),
    ("Non-overlapping", box_a, box_c),
]

print("="*70)
print("IoU VARIANT COMPARISON")
print("="*70)

for scenario, b1, b2 in boxes_to_test:
    iou = compute_iou(b1, b2)
    giou = compute_giou(b1, b2)
    diou = compute_diou(b1, b2)
    ciou = compute_ciou(b1, b2)
    
    print(f"\n{scenario}:")
    print(f"  IoU:   {iou:7.4f}")
    print(f"  GIoU:  {giou:7.4f} (gradient for non-overlap: {giou if iou == 0 else 'N/A'})")
    print(f"  DIoU:  {diou:7.4f} (considers center distance)")
    print(f"  CIoU:  {ciou:7.4f} (aspect ratio consistency)")

print("\n" + "="*70)
print("KEY INSIGHTS")
print("="*70)
print("β€’ IoU:  Classic metric, but gradient vanishes when boxes don't overlap")
print("β€’ GIoU: Provides gradient even for non-overlapping boxes")
print("β€’ DIoU: Faster convergence by minimizing center distance")
print("β€’ CIoU: Best for trainingβ€”matches aspect ratio + position + overlap")
print("="*70)

# ============================================================
# 4. NMS Variants Comparison
# ============================================================

# Create clustered detections (simulating multiple detections of same object)
detections_clustered = [
    BoundingBox(50, 50, 100, 100, 0.95, 0, "person"),
    BoundingBox(52, 52, 102, 98, 0.93, 0, "person"),
    BoundingBox(48, 51, 98, 102, 0.91, 0, "person"),
    BoundingBox(55, 48, 95, 105, 0.88, 0, "person"),
    BoundingBox(200, 200, 80, 80, 0.90, 1, "car"),
    BoundingBox(202, 198, 82, 82, 0.87, 1, "car"),
]

print("\n" + "="*70)
print("NMS VARIANT COMPARISON")
print("="*70)
print(f"Original detections: {len(detections_clustered)}")

# Standard NMS
standard_nms = non_max_suppression(detections_clustered.copy(), iou_threshold=0.5)
print(f"\nStandard NMS:  {len(standard_nms)} boxes kept")

# Soft-NMS
soft_nms_result = soft_nms(detections_clustered.copy(), sigma=0.5, score_threshold=0.3)
print(f"Soft-NMS:      {len(soft_nms_result)} boxes kept (gentler suppression)")

# GIoU-NMS
giou_nms_result = nms_with_giou(detections_clustered.copy(), iou_threshold=0.5)
print(f"GIoU-NMS:      {len(giou_nms_result)} boxes kept (better for difficult cases)")

print("\n" + "="*70)
print("RECOMMENDATIONS")
print("="*70)
print("β€’ Standard NMS:  Fast, works well for most cases")
print("β€’ Soft-NMS:      Better for occluded/crowded scenes (keeps more boxes)")
print("β€’ GIoU-NMS:      More robust to box orientation/aspect ratio")
print("="*70)

YOLO Object DetectorΒΆ

# YOLO with Ultralytics (requires installation)
'''
from ultralytics import YOLO
import cv2

# Load YOLOv8 model
model = YOLO('yolov8n.pt')  # nano model (fastest)
# Other options: yolov8s, yolov8m, yolov8l, yolov8x

# Detect objects in image
results = model('path/to/image.jpg')

# Process results
for result in results:
    boxes = result.boxes  # Bounding boxes
    for box in boxes:
        x1, y1, x2, y2 = box.xyxy[0]
        conf = box.conf[0]
        cls = box.cls[0]
        print(f"Detected {model.names[int(cls)]} at ({x1:.0f}, {y1:.0f}) with confidence {conf:.2f}")

# Real-time detection from webcam
cap = cv2.VideoCapture(0)
while True:
    ret, frame = cap.read()
    if not ret:
        break
    
    results = model(frame, stream=True)
    
    for result in results:
        annotated = result.plot()  # Draw boxes
        cv2.imshow('YOLO', annotated)
    
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()
'''

print("YOLOv8 detection example (commented - requires ultralytics)")
print("\nYOLO Models:")
print("  yolov8n - Nano (fastest, least accurate)")
print("  yolov8s - Small")
print("  yolov8m - Medium")
print("  yolov8l - Large")
print("  yolov8x - Extra Large (slowest, most accurate)")

Custom Object DetectorΒΆ

class ObjectDetector:
    """Simple object detection wrapper"""
    
    def __init__(self, conf_threshold: float = 0.5, iou_threshold: float = 0.5):
        self.conf_threshold = conf_threshold
        self.iou_threshold = iou_threshold
        self.class_names = self._load_class_names()
    
    def _load_class_names(self) -> List[str]:
        """Load COCO class names"""
        # COCO 80 classes (subset shown)
        return [
            'person', 'bicycle', 'car', 'motorcycle', 'airplane',
            'bus', 'train', 'truck', 'boat', 'traffic light',
            'cat', 'dog', 'horse', 'sheep', 'cow'
            # ... 65 more classes
        ]
    
    def detect(self, image: np.ndarray) -> List[BoundingBox]:
        """Detect objects in image"""
        # Simulate detections
        raw_detections = self._simulate_detections()
        
        # Filter by confidence
        filtered = [d for d in raw_detections if d.confidence >= self.conf_threshold]
        
        # Apply NMS
        final_detections = non_max_suppression(filtered, self.iou_threshold)
        
        return final_detections
    
    def _simulate_detections(self) -> List[BoundingBox]:
        """Simulate raw model output"""
        # In production: actual model inference
        return [
            BoundingBox(50, 50, 100, 150, 0.95, 0, "person"),
            BoundingBox(52, 52, 100, 150, 0.92, 0, "person"),  # Duplicate
            BoundingBox(200, 100, 80, 60, 0.88, 2, "car"),
            BoundingBox(150, 300, 50, 50, 0.76, 10, "cat"),
            BoundingBox(400, 200, 120, 100, 0.42, 3, "motorcycle"),  # Low conf
        ]
    
    def visualize(self, image: np.ndarray, boxes: List[BoundingBox]) -> np.ndarray:
        """Draw boxes on image"""
        # In production: use cv2.rectangle() to draw boxes
        print(f"\nWould draw {len(boxes)} boxes on image:")
        for box in boxes:
            print(f"  {box.class_name}: ({box.x:.0f}, {box.y:.0f}, {box.width:.0f}, {box.height:.0f}) - {box.confidence:.2f}")
        return image

# Test detector
detector = ObjectDetector(conf_threshold=0.5, iou_threshold=0.5)

# Dummy image
image = np.zeros((640, 640, 3), dtype=np.uint8)
detections = detector.detect(image)

print(f"\nDetected {len(detections)} objects:")
for det in detections:
    print(f"  {det.class_name}: {det.confidence:.2%} at ({det.x:.0f}, {det.y:.0f})")

# Visualize
annotated = detector.visualize(image, detections)

Evaluation MetricsΒΆ

def compute_precision_recall(predictions: List[BoundingBox], 
                              ground_truth: List[BoundingBox],
                              iou_threshold: float = 0.5) -> Tuple[float, float]:
    """Compute precision and recall"""
    true_positives = 0
    matched_gt = set()
    
    for pred in predictions:
        best_iou = 0
        best_gt_idx = -1
        
        for idx, gt in enumerate(ground_truth):
            if gt.class_id != pred.class_id:
                continue
            
            iou = compute_iou(pred, gt)
            if iou > best_iou:
                best_iou = iou
                best_gt_idx = idx
        
        if best_iou >= iou_threshold and best_gt_idx not in matched_gt:
            true_positives += 1
            matched_gt.add(best_gt_idx)
    
    false_positives = len(predictions) - true_positives
    false_negatives = len(ground_truth) - len(matched_gt)
    
    precision = true_positives / (true_positives + false_positives) if predictions else 0
    recall = true_positives / (true_positives + false_negatives) if ground_truth else 0
    
    return precision, recall

def compute_ap(precisions: List[float], recalls: List[float]) -> float:
    """Compute Average Precision (AP)"""
    # Sort by recall
    sorted_indices = np.argsort(recalls)
    recalls = np.array(recalls)[sorted_indices]
    precisions = np.array(precisions)[sorted_indices]
    
    # Compute AP using 11-point interpolation
    ap = 0
    for t in np.arange(0, 1.1, 0.1):
        if np.sum(recalls >= t) == 0:
            p = 0
        else:
            p = np.max(precisions[recalls >= t])
        ap += p / 11
    
    return ap

# Test metrics
pred_boxes = [
    BoundingBox(10, 10, 50, 50, 0.9, 0, "person"),
    BoundingBox(100, 100, 50, 50, 0.8, 1, "car"),
]

gt_boxes = [
    BoundingBox(12, 12, 50, 50, 1.0, 0, "person"),
    BoundingBox(102, 102, 50, 50, 1.0, 1, "car"),
    BoundingBox(200, 200, 50, 50, 1.0, 2, "dog"),  # Missed
]

precision, recall = compute_precision_recall(pred_boxes, gt_boxes)
print(f"Precision: {precision:.2%}")
print(f"Recall: {recall:.2%}")
print(f"F1 Score: {2 * precision * recall / (precision + recall):.2%}")

Production DeploymentΒΆ

import time
from collections import deque

class ProductionDetector:
    """Production-ready object detector"""
    
    def __init__(self, model_name: str = "yolov8n"):
        self.model_name = model_name
        self.detector = ObjectDetector()
        self.stats = {
            "total_images": 0,
            "total_detections": 0,
            "avg_inference_time": 0,
            "fps_history": deque(maxlen=30)
        }
    
    def detect_with_timing(self, image: np.ndarray) -> Tuple[List[BoundingBox], float]:
        """Detect with performance tracking"""
        start = time.time()
        detections = self.detector.detect(image)
        inference_time = time.time() - start
        
        # Update stats
        self.stats["total_images"] += 1
        self.stats["total_detections"] += len(detections)
        self.stats["fps_history"].append(1 / inference_time if inference_time > 0 else 0)
        self.stats["avg_inference_time"] = (
            (self.stats["avg_inference_time"] * (self.stats["total_images"] - 1) + inference_time)
            / self.stats["total_images"]
        )
        
        return detections, inference_time
    
    def get_performance_stats(self) -> dict:
        """Get performance statistics"""
        avg_fps = np.mean(self.stats["fps_history"]) if self.stats["fps_history"] else 0
        
        return {
            "total_images": self.stats["total_images"],
            "total_detections": self.stats["total_detections"],
            "avg_detections_per_image": (
                self.stats["total_detections"] / max(self.stats["total_images"], 1)
            ),
            "avg_inference_time_ms": self.stats["avg_inference_time"] * 1000,
            "avg_fps": avg_fps
        }

# Test production detector
prod_detector = ProductionDetector()

# Process images
for i in range(10):
    image = np.zeros((640, 640, 3), dtype=np.uint8)
    detections, time_ms = prod_detector.detect_with_timing(image)

# Print stats
stats = prod_detector.get_performance_stats()
print("\nPerformance Statistics:")
print(f"  Total Images: {stats['total_images']}")
print(f"  Total Detections: {stats['total_detections']}")
print(f"  Avg Detections/Image: {stats['avg_detections_per_image']:.1f}")
print(f"  Avg Inference Time: {stats['avg_inference_time_ms']:.2f}ms")
print(f"  Avg FPS: {stats['avg_fps']:.1f}")
# Anchor Generation and Assignment

import numpy as np
import matplotlib.pyplot as plt
from typing import List, Tuple

# ============================================================
# 1. Anchor Box Generation (Faster R-CNN style)
# ============================================================

def generate_anchors(base_size: int = 16,
                     ratios: List[float] = [0.5, 1.0, 2.0],
                     scales: List[int] = [8, 16, 32]) -> np.ndarray:
    """
    Generate anchor boxes with different scales and aspect ratios.
    
    Parameters:
    -----------
    base_size : Base anchor size
    ratios : Aspect ratios (h/w)
    scales : Scales relative to base_size
    
    Returns:
    --------
    anchors : (k, 4) array of anchors in (x_min, y_min, x_max, y_max) format
    """
    anchors = []
    
    for scale in scales:
        for ratio in ratios:
            # Compute width and height
            h = base_size * scale * np.sqrt(ratio)
            w = base_size * scale / np.sqrt(ratio)
            
            # Center at (0, 0)
            x_min = -w / 2
            y_min = -h / 2
            x_max = w / 2
            y_max = h / 2
            
            anchors.append([x_min, y_min, x_max, y_max])
    
    return np.array(anchors)

# Generate default anchors
anchors = generate_anchors(base_size=16, ratios=[0.5, 1.0, 2.0], scales=[8, 16, 32])

print("="*70)
print("ANCHOR BOX GENERATION")
print("="*70)
print(f"Generated {len(anchors)} anchors:")
print(f"  3 aspect ratios Γ— 3 scales = 9 anchors per position")
print("\nAnchor dimensions (width Γ— height):")
for idx, anchor in enumerate(anchors):
    w = anchor[2] - anchor[0]
    h = anchor[3] - anchor[1]
    ratio = h / w
    print(f"  Anchor {idx+1}: {w:6.1f} Γ— {h:6.1f}  (ratio: {ratio:.2f})")

# ============================================================
# 2. K-Means Anchor Clustering (YOLO style)
# ============================================================

def kmeans_anchors(boxes: np.ndarray, k: int = 9, max_iters: int = 100) -> np.ndarray:
    """
    Run k-means clustering on box dimensions using IoU distance.
    
    Parameters:
    -----------
    boxes : (n, 2) array of (width, height)
    k : Number of clusters
    
    Returns:
    --------
    anchors : (k, 2) array of anchor (width, height)
    """
    n = boxes.shape[0]
    
    # Random initialization
    np.random.seed(42)
    anchors = boxes[np.random.choice(n, k, replace=False)]
    
    def iou_wh(wh1, wh2):
        """IoU for width-height pairs (assuming aligned at center)"""
        w1, h1 = wh1
        w2, h2 = wh2
        
        inter = np.minimum(w1, w2) * np.minimum(h1, h2)
        union = w1 * h1 + w2 * h2 - inter
        
        return inter / (union + 1e-7)
    
    for iteration in range(max_iters):
        # Assign boxes to nearest anchor
        distances = np.zeros((n, k))
        for i, box in enumerate(boxes):
            for j, anchor in enumerate(anchors):
                distances[i, j] = 1 - iou_wh(box, anchor)  # Distance = 1 - IoU
        
        assignments = np.argmin(distances, axis=1)
        
        # Update anchors
        new_anchors = np.zeros((k, 2))
        for j in range(k):
            cluster_boxes = boxes[assignments == j]
            if len(cluster_boxes) > 0:
                new_anchors[j] = cluster_boxes.mean(axis=0)
            else:
                new_anchors[j] = anchors[j]  # Keep old if no assignment
        
        # Check convergence
        if np.allclose(anchors, new_anchors):
            break
        
        anchors = new_anchors
    
    # Sort by area
    areas = anchors[:, 0] * anchors[:, 1]
    sorted_indices = np.argsort(areas)
    anchors = anchors[sorted_indices]
    
    return anchors

# Simulate COCO-like box distribution
np.random.seed(42)
n_boxes = 1000

# Generate realistic box distributions
# Small objects (people, animals): 30-100 pixels
small_boxes = np.random.uniform(30, 100, (400, 2))

# Medium objects (cars, furniture): 100-250 pixels  
medium_boxes = np.random.uniform(100, 250, (400, 2))

# Large objects (buildings, scenes): 250-500 pixels
large_boxes = np.random.uniform(250, 500, (200, 2))

all_boxes = np.vstack([small_boxes, medium_boxes, large_boxes])

# Run k-means
learned_anchors = kmeans_anchors(all_boxes, k=9, max_iters=50)

print("\n" + "="*70)
print("K-MEANS ANCHOR LEARNING (YOLO-style)")
print("="*70)
print(f"Learned {len(learned_anchors)} anchors from {len(all_boxes)} boxes:")
print("\nAnchor dimensions (width Γ— height):")
for idx, (w, h) in enumerate(learned_anchors):
    ratio = h / w
    area = w * h
    print(f"  Anchor {idx+1}: {w:6.1f} Γ— {h:6.1f}  (ratio: {ratio:.2f}, area: {area:8.0f})")

# ============================================================
# 3. Anchor Assignment Strategy
# ============================================================

def assign_anchors_to_gt(gt_boxes: np.ndarray,
                         anchors: np.ndarray,
                         pos_iou_thresh: float = 0.7,
                         neg_iou_thresh: float = 0.3) -> Tuple[np.ndarray, np.ndarray]:
    """
    Assign anchors to ground truth boxes (Faster R-CNN strategy).
    
    Parameters:
    -----------
    gt_boxes : (m, 4) ground truth boxes [x_min, y_min, x_max, y_max]
    anchors : (n, 4) anchor boxes
    
    Returns:
    --------
    labels : (n,) array {-1: ignore, 0: background, 1: object}
    targets : (n, 4) box regression targets
    """
    n_anchors = len(anchors)
    n_gt = len(gt_boxes)
    
    labels = -np.ones(n_anchors, dtype=np.int32)  # -1 = ignore
    targets = np.zeros((n_anchors, 4))
    
    if n_gt == 0:
        labels[:] = 0  # All background
        return labels, targets
    
    # Compute IoU matrix
    ious = np.zeros((n_anchors, n_gt))
    for i, anchor in enumerate(anchors):
        for j, gt in enumerate(gt_boxes):
            # Compute IoU (simplified for demonstration)
            x_min = max(anchor[0], gt[0])
            y_min = max(anchor[1], gt[1])
            x_max = min(anchor[2], gt[2])
            y_max = min(anchor[3], gt[3])
            
            inter = max(0, x_max - x_min) * max(0, y_max - y_min)
            
            area_a = (anchor[2] - anchor[0]) * (anchor[3] - anchor[1])
            area_g = (gt[2] - gt[0]) * (gt[3] - gt[1])
            union = area_a + area_g - inter
            
            ious[i, j] = inter / (union + 1e-7)
    
    # Assign labels
    max_iou_per_anchor = ious.max(axis=1)
    max_gt_per_anchor = ious.argmax(axis=1)
    
    # Rule 1: IoU > pos_thresh β†’ positive
    labels[max_iou_per_anchor >= pos_iou_thresh] = 1
    
    # Rule 2: IoU < neg_thresh β†’ negative
    labels[max_iou_per_anchor < neg_iou_thresh] = 0
    
    # Rule 3: For each GT, assign anchor with highest IoU
    max_iou_per_gt = ious.max(axis=0)
    for j in range(n_gt):
        best_anchor = ious[:, j].argmax()
        labels[best_anchor] = 1
        max_gt_per_anchor[best_anchor] = j
    
    # Compute box regression targets (for positive anchors)
    for i in range(n_anchors):
        if labels[i] == 1:
            anchor = anchors[i]
            gt = gt_boxes[max_gt_per_anchor[i]]
            
            # Parameterized offsets
            ax_ctr = (anchor[0] + anchor[2]) / 2
            ay_ctr = (anchor[1] + anchor[3]) / 2
            aw = anchor[2] - anchor[0]
            ah = anchor[3] - anchor[1]
            
            gx_ctr = (gt[0] + gt[2]) / 2
            gy_ctr = (gt[1] + gt[3]) / 2
            gw = gt[2] - gt[0]
            gh = gt[3] - gt[1]
            
            targets[i, 0] = (gx_ctr - ax_ctr) / aw
            targets[i, 1] = (gy_ctr - ay_ctr) / ah
            targets[i, 2] = np.log(gw / aw)
            targets[i, 3] = np.log(gh / ah)
    
    return labels, targets

# Test anchor assignment
test_gt = np.array([[100, 100, 200, 200], [300, 150, 450, 300]])
test_anchors = np.array([
    [90, 90, 210, 210],    # High IoU with GT1
    [150, 150, 250, 250],  # Medium IoU with GT1
    [295, 145, 455, 305],  # High IoU with GT2
    [500, 500, 550, 550],  # No overlap (background)
])

labels, targets = assign_anchors_to_gt(test_gt, test_anchors, pos_iou_thresh=0.7, neg_iou_thresh=0.3)

print("\n" + "="*70)
print("ANCHOR ASSIGNMENT EXAMPLE")
print("="*70)
print(f"Ground Truth boxes: {len(test_gt)}")
print(f"Anchors: {len(test_anchors)}")
print("\nAssignment results:")
for i, (label, target) in enumerate(zip(labels, targets)):
    status = {-1: "IGNORE", 0: "BACKGROUND", 1: "OBJECT"}[label]
    print(f"  Anchor {i+1}: {status:12s}", end="")
    if label == 1:
        print(f" β†’ targets: ({target[0]:6.3f}, {target[1]:6.3f}, {target[2]:6.3f}, {target[3]:6.3f})")
    else:
        print()

print("\n" + "="*70)
print("ASSIGNMENT RULES")
print("="*70)
print("1. IoU β‰₯ 0.7 with any GT β†’ POSITIVE (object)")
print("2. IoU < 0.3 with all GT β†’ NEGATIVE (background)")
print("3. 0.3 ≀ IoU < 0.7 β†’ IGNORE (ambiguous, don't train)")
print("4. Best anchor for each GT β†’ POSITIVE (ensure every GT has match)")
print("="*70)

Data Augmentation for Object DetectionΒΆ

Data augmentation is crucial for training robust object detectors. Unlike classification, augmentation must transform both images AND bounding box annotations consistently.

1. Geometric AugmentationsΒΆ

Random Horizontal FlipΒΆ

  • Operation: Flip image left-right

  • Box transformation: \(x' = W - x\), where \(W\) is image width

  • Use case: Objects with horizontal symmetry (cars, people)

  • Probability: Typically 0.5

Random Scaling and TranslationΒΆ

  • Scale: Resize image by factor \(s \in [s_{min}, s_{max}]\)

  • Translate: Shift by \((\Delta x, \Delta y)\)

  • Box update: $\(x_{new} = s \cdot x + \Delta x\)\( \)\(y_{new} = s \cdot y + \Delta y\)$

  • Typical ranges: \(s \in [0.8, 1.2]\), \(|\Delta| \leq 0.1W\)

Random RotationΒΆ

  • Operation: Rotate by angle \(\theta\)

  • Box transformation: Compute corners, rotate, find new axis-aligned bbox

  • Challenge: Bounding box becomes larger after rotation

  • Alternative: Use oriented bounding boxes (OBB)

2. Mosaic Augmentation (YOLO v4+)ΒΆ

Combines 4 images into one mosaic:

\[\begin{split}I_{mosaic} = \begin{bmatrix} I_1 & I_2 \\ I_3 & I_4 \end{bmatrix}\end{split}\]

Procedure:

  1. Sample 4 images

  2. Resize to random scales

  3. Place at 4 quadrants with random center point

  4. Adjust all bounding boxes to new coordinates

Benefits:

  • Exposes model to more objects per batch

  • Forces model to learn from different scales simultaneously

  • Improves small object detection

  • Reduces batch normalization artifacts

3. MixUp for Object DetectionΒΆ

Blend two images with ratio \(\lambda \sim Beta(\alpha, \alpha)\):

\[I_{mix} = \lambda I_1 + (1-\lambda) I_2\]
\[\mathcal{B}_{mix} = \mathcal{B}_1 \cup \mathcal{B}_2\]

Modifications for detection:

  • Keep all bounding boxes from both images

  • Optional: Weight box confidence by \(\lambda\)

  • Typical \(\alpha = 1.5\) (vs. \(\alpha = 1.0\) in classification)

4. Copy-Paste AugmentationΒΆ

From instance segmentation masks:

  1. Extract object from image 1 using mask

  2. Paste onto image 2 at random location

  3. Add bounding box to image 2 annotations

Advanced: Use Poisson blending for seamless integration

5. Color/Photometric AugmentationsΒΆ

These don’t affect bounding boxes:

Augmentation

Operation

Range

Brightness

\(I' = I + \beta\)

\(\beta \in [-30, 30]\)

Contrast

\(I' = \alpha I\)

\(\alpha \in [0.8, 1.2]\)

Saturation

Adjust in HSV space

\(\times [0.7, 1.3]\)

Hue

Shift hue channel

\(\pm 10Β°\)

HSV transformation: $\(I_{HSV}' = \begin{bmatrix} H + \Delta H \\ \alpha_S \cdot S \\ \alpha_V \cdot V \end{bmatrix}\)$

6. Random Crop and PaddingΒΆ

IoU-based Cropping (SSD-style):ΒΆ

  1. Sample crop with IoU \(\in \{0.1, 0.3, 0.5, 0.7, 0.9, 1.0\}\) with some GT box

  2. Reject if crop doesn’t contain any object center

  3. Adjust boxes: clip to crop boundaries, remove if center outside

Padding:ΒΆ

  • Add borders: top, bottom, left, right

  • Keep boxes unchanged (still valid in larger canvas)

  • Useful for preserving aspect ratio

7. Advanced TechniquesΒΆ

CutOut / Random ErasingΒΆ

  • Randomly mask rectangular regions

  • For detection: Avoid erasing object centers

  • Forces model to use partial information

AutoAugment for DetectionΒΆ

Learn augmentation policy via RL:

  • Search space: 20+ operations with magnitude

  • Optimize for mAP on validation set

  • Computationally expensive but effective

Test-Time Augmentation (TTA)ΒΆ

At inference:

  1. Apply multiple augmentations to input

  2. Run detector on each

  3. Aggregate predictions (NMS across all)

8. Augmentation PipelineΒΆ

Training loop:

For each image:
  1. Mosaic (prob=0.5)
  2. Random scale (0.5-1.5Γ—)
  3. Random flip (prob=0.5)
  4. HSV adjustment (always)
  5. MixUp (prob=0.1)
  6. Random crop (IoU-based)
  7. Normalize and resize to input size

Key principles:

  • Always maintain box-image consistency

  • Discard boxes with area < threshold (e.g., 16 pixels)

  • Clip boxes to image boundaries

  • Remove empty boxes (IoU with image = 0)

9. Augmentation Strength SchedulingΒΆ

Warm-up phase (epochs 1-10):

  • Mild augmentation: flip, small scale, color jitter

  • Helps stable training start

Main phase (epochs 10-270):

  • Full augmentation: mosaic, mixup, aggressive scaling

Fine-tuning phase (epochs 270-300):

  • Disable mosaic/mixup

  • Only flip and mild scaling

  • Helps model adapt to real data distribution

10. Domain-Specific ConsiderationsΒΆ

Domain

Key Augmentations

Avoid

Autonomous Driving

Flip, scale, cutout, weather effects

Rotation (road is horizontal)

Retail (shelf detection)

Scale, brightness, crop

Flip (text orientation matters)

Aerial Imagery

Rotation, flip, scale

Extreme crops (context matters)

Medical Imaging

Rotation, flip, elastic deformation

Color jitter (preserves diagnostic info)

Best practice: Analyze failure cases, add targeted augmentations to address them.

# Data Augmentation Implementations

import numpy as np
import matplotlib.pyplot as plt
from typing import List, Tuple
import cv2

# ============================================================
# 1. Geometric Augmentations with Box Updates
# ============================================================

def horizontal_flip(image: np.ndarray, boxes: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
    """
    Flip image horizontally and update boxes.
    
    Parameters:
    -----------
    image : (H, W, C) image array
    boxes : (n, 4) boxes in [x_min, y_min, x_max, y_max] format
    
    Returns:
    --------
    flipped_image, flipped_boxes
    """
    H, W = image.shape[:2]
    
    # Flip image
    flipped_image = np.fliplr(image)
    
    # Update boxes: x' = W - x
    flipped_boxes = boxes.copy()
    flipped_boxes[:, [0, 2]] = W - boxes[:, [2, 0]]  # Swap and flip x coordinates
    
    return flipped_image, flipped_boxes

def random_scale_translate(image: np.ndarray, 
                           boxes: np.ndarray,
                           scale_range: Tuple[float, float] = (0.8, 1.2),
                           translate_range: float = 0.1) -> Tuple[np.ndarray, np.ndarray]:
    """
    Randomly scale and translate image and boxes.
    """
    H, W = image.shape[:2]
    
    # Random scale and translation
    scale = np.random.uniform(*scale_range)
    tx = np.random.uniform(-translate_range * W, translate_range * W)
    ty = np.random.uniform(-translate_range * H, translate_range * H)
    
    # Transformation matrix
    M = np.array([[scale, 0, tx],
                  [0, scale, ty]])
    
    # Transform image
    new_H, new_W = int(H * scale), int(W * scale)
    transformed = cv2.warpAffine(image, M, (new_W, new_H))
    
    # Transform boxes
    transformed_boxes = boxes.copy()
    transformed_boxes[:, [0, 2]] = transformed_boxes[:, [0, 2]] * scale + tx
    transformed_boxes[:, [1, 3]] = transformed_boxes[:, [1, 3]] * scale + ty
    
    # Clip to image boundaries
    transformed_boxes[:, [0, 2]] = np.clip(transformed_boxes[:, [0, 2]], 0, new_W)
    transformed_boxes[:, [1, 3]] = np.clip(transformed_boxes[:, [1, 3]], 0, new_H)
    
    return transformed, transformed_boxes

# ============================================================
# 2. Mosaic Augmentation
# ============================================================

def create_mosaic(images: List[np.ndarray], 
                  boxes_list: List[np.ndarray],
                  output_size: int = 640) -> Tuple[np.ndarray, np.ndarray]:
    """
    Create mosaic from 4 images (YOLO-style).
    
    Parameters:
    -----------
    images : List of 4 images
    boxes_list : List of 4 box arrays (each n_i Γ— 4)
    output_size : Output mosaic size
    
    Returns:
    --------
    mosaic_image, mosaic_boxes
    """
    assert len(images) == 4, "Need exactly 4 images for mosaic"
    
    # Random center point
    cx = np.random.randint(output_size // 4, 3 * output_size // 4)
    cy = np.random.randint(output_size // 4, 3 * output_size // 4)
    
    mosaic = np.zeros((output_size, output_size, 3), dtype=np.uint8)
    mosaic_boxes = []
    
    # Quadrant offsets: top-left, top-right, bottom-left, bottom-right
    quadrants = [
        (0, 0, cx, cy),          # Top-left
        (cx, 0, output_size, cy),  # Top-right
        (0, cy, cx, output_size),  # Bottom-left
        (cx, cy, output_size, output_size)  # Bottom-right
    ]
    
    for idx, (img, boxes) in enumerate(zip(images, boxes_list)):
        x1, y1, x2, y2 = quadrants[idx]
        quad_w, quad_h = x2 - x1, y2 - y1
        
        # Resize image to fit quadrant
        img_resized = cv2.resize(img, (quad_w, quad_h))
        
        # Place in mosaic
        mosaic[y1:y2, x1:x2] = img_resized
        
        # Transform boxes
        if len(boxes) > 0:
            H, W = img.shape[:2]
            scale_x = quad_w / W
            scale_y = quad_h / H
            
            transformed_boxes = boxes.copy()
            transformed_boxes[:, [0, 2]] = boxes[:, [0, 2]] * scale_x + x1
            transformed_boxes[:, [1, 3]] = boxes[:, [1, 3]] * scale_y + y1
            
            # Clip to mosaic boundaries
            transformed_boxes[:, [0, 2]] = np.clip(transformed_boxes[:, [0, 2]], 0, output_size)
            transformed_boxes[:, [1, 3]] = np.clip(transformed_boxes[:, [1, 3]], 0, output_size)
            
            mosaic_boxes.append(transformed_boxes)
    
    mosaic_boxes = np.vstack(mosaic_boxes) if mosaic_boxes else np.zeros((0, 4))
    
    return mosaic, mosaic_boxes

# ============================================================
# 3. MixUp for Detection
# ============================================================

def mixup_detection(image1: np.ndarray, boxes1: np.ndarray,
                    image2: np.ndarray, boxes2: np.ndarray,
                    alpha: float = 1.5) -> Tuple[np.ndarray, np.ndarray]:
    """
    Apply MixUp to two images and combine their boxes.
    """
    # Sample mixing ratio
    lam = np.random.beta(alpha, alpha)
    
    # Mix images
    mixed_image = (lam * image1 + (1 - lam) * image2).astype(np.uint8)
    
    # Combine boxes
    mixed_boxes = np.vstack([boxes1, boxes2]) if len(boxes1) > 0 and len(boxes2) > 0 else boxes1
    
    return mixed_image, mixed_boxes

# ============================================================
# 4. Color Jittering
# ============================================================

def color_jitter(image: np.ndarray,
                 brightness: float = 30,
                 contrast: Tuple[float, float] = (0.8, 1.2),
                 saturation: Tuple[float, float] = (0.7, 1.3),
                 hue: float = 10) -> np.ndarray:
    """
    Apply random color jittering in HSV space.
    """
    # Convert to HSV
    hsv = cv2.cvtColor(image, cv2.COLOR_RGB2HSV).astype(np.float32)
    
    # Hue shift
    hsv[:, :, 0] += np.random.uniform(-hue, hue)
    hsv[:, :, 0] = np.clip(hsv[:, :, 0], 0, 179)
    
    # Saturation scaling
    hsv[:, :, 1] *= np.random.uniform(*saturation)
    hsv[:, :, 1] = np.clip(hsv[:, :, 1], 0, 255)
    
    # Value (brightness) scaling
    hsv[:, :, 2] *= np.random.uniform(*contrast)
    hsv[:, :, 2] = np.clip(hsv[:, :, 2], 0, 255)
    
    # Convert back to RGB
    jittered = cv2.cvtColor(hsv.astype(np.uint8), cv2.COLOR_HSV2RGB)
    
    return jittered

# ============================================================
# 5. Demonstration
# ============================================================

# Create synthetic images with boxes
def create_test_image(color, box):
    """Helper to create test image with one box"""
    img = np.full((300, 300, 3), color, dtype=np.uint8)
    x1, y1, x2, y2 = box
    cv2.rectangle(img, (int(x1), int(y1)), (int(x2), int(y2)), (255, 255, 255), 2)
    return img

# Test images
img1 = create_test_image([200, 100, 100], [50, 50, 150, 150])
boxes1 = np.array([[50, 50, 150, 150]])

img2 = create_test_image([100, 200, 100], [180, 180, 280, 280])
boxes2 = np.array([[180, 180, 280, 280]])

img3 = create_test_image([100, 100, 200], [60, 120, 160, 220])
boxes3 = np.array([[60, 120, 160, 220]])

img4 = create_test_image([200, 200, 100], [100, 50, 250, 150])
boxes4 = np.array([[100, 50, 250, 150]])

# Apply augmentations
flipped_img, flipped_boxes = horizontal_flip(img1, boxes1)
mosaic_img, mosaic_boxes = create_mosaic([img1, img2, img3, img4], 
                                         [boxes1, boxes2, boxes3, boxes4], 
                                         output_size=400)
mixed_img, mixed_boxes = mixup_detection(img1, boxes1, img2, boxes2, alpha=1.5)
jittered_img = color_jitter(img1.copy())

# Visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Object Detection Data Augmentation Examples', fontsize=16, fontweight='bold')

# Original
axes[0, 0].imshow(img1)
axes[0, 0].add_patch(plt.Rectangle((boxes1[0, 0], boxes1[0, 1]), 
                                   boxes1[0, 2] - boxes1[0, 0], 
                                   boxes1[0, 3] - boxes1[0, 1],
                                   fill=False, edgecolor='yellow', linewidth=2))
axes[0, 0].set_title('Original Image')
axes[0, 0].axis('off')

# Horizontal Flip
axes[0, 1].imshow(flipped_img)
axes[0, 1].add_patch(plt.Rectangle((flipped_boxes[0, 0], flipped_boxes[0, 1]), 
                                   flipped_boxes[0, 2] - flipped_boxes[0, 0], 
                                   flipped_boxes[0, 3] - flipped_boxes[0, 1],
                                   fill=False, edgecolor='yellow', linewidth=2))
axes[0, 1].set_title('Horizontal Flip')
axes[0, 1].axis('off')

# Mosaic
axes[0, 2].imshow(mosaic_img)
for box in mosaic_boxes:
    axes[0, 2].add_patch(plt.Rectangle((box[0], box[1]), box[2] - box[0], box[3] - box[1],
                                       fill=False, edgecolor='yellow', linewidth=2))
axes[0, 2].set_title(f'Mosaic ({len(mosaic_boxes)} boxes)')
axes[0, 2].axis('off')

# MixUp
axes[1, 0].imshow(mixed_img)
for box in mixed_boxes:
    axes[1, 0].add_patch(plt.Rectangle((box[0], box[1]), box[2] - box[0], box[3] - box[1],
                                       fill=False, edgecolor='yellow', linewidth=2))
axes[1, 0].set_title(f'MixUp ({len(mixed_boxes)} boxes)')
axes[1, 0].axis('off')

# Color Jitter
axes[1, 1].imshow(jittered_img)
axes[1, 1].add_patch(plt.Rectangle((boxes1[0, 0], boxes1[0, 1]), 
                                   boxes1[0, 2] - boxes1[0, 0], 
                                   boxes1[0, 3] - boxes1[0, 1],
                                   fill=False, edgecolor='yellow', linewidth=2))
axes[1, 1].set_title('Color Jitter (HSV)')
axes[1, 1].axis('off')

# Statistics
axes[1, 2].axis('off')
stats_text = f"""
AUGMENTATION STATISTICS

Original: {len(boxes1)} box
Flip: {len(flipped_boxes)} box
Mosaic: {len(mosaic_boxes)} boxes
MixUp: {len(mixed_boxes)} boxes

KEY INSIGHTS:
β€’ Mosaic combines 4 images
β€’ All boxes preserved
β€’ Coordinates transformed
β€’ Boundaries clipped

RECOMMENDATIONS:
βœ“ Flip: 50% probability
βœ“ Mosaic: 50% in training
βœ“ MixUp: 10% probability
βœ“ Color: Always apply
βœ“ Disable mosaic in last
  10 epochs for fine-tuning
"""
axes[1, 2].text(0.1, 0.5, stats_text, fontsize=11, verticalalignment='center',
                family='monospace', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.3))

plt.tight_layout()
plt.show()

print("="*70)
print("DATA AUGMENTATION SUMMARY")
print("="*70)
print(f"βœ“ Horizontal flip: Boxes transformed correctly")
print(f"βœ“ Mosaic: Combined {len(mosaic_boxes)} boxes from 4 images")
print(f"βœ“ MixUp: Merged {len(mixed_boxes)} boxes")
print(f"βœ“ Color jitter: No box transformation needed")
print("="*70)

Best PracticesΒΆ

1. Model SelectionΒΆ

  • Real-time (>30 FPS): YOLOv8n, YOLOv8s

  • Balanced: YOLOv8m, DETR

  • High accuracy: YOLOv8x, Faster R-CNN

  • Edge devices: YOLOv8n with TensorRT

2. HyperparametersΒΆ

  • Confidence threshold: 0.25-0.5 (lower = more detections)

  • IoU threshold (NMS): 0.45-0.65 (lower = fewer duplicates)

  • Image size: 640Γ—640 (YOLO), can go lower for speed

3. Training TipsΒΆ

  • Use data augmentation (mosaic, mixup)

  • Balance classes with weighted sampling

  • Train on high-resolution images

  • Freeze backbone initially, then fine-tune

4. OptimizationΒΆ

  • Convert to ONNX/TensorRT for faster inference

  • Batch processing when possible

  • Resize images to smaller sizes (320Γ—320, 416Γ—416)

  • Use FP16 precision on GPUs

Common Use CasesΒΆ

  • Autonomous vehicles: Detect pedestrians, cars, traffic signs

  • Surveillance: People counting, intrusion detection

  • Retail: Product detection, shelf monitoring

  • Manufacturing: Defect detection, quality control

  • Agriculture: Crop monitoring, pest detection

Key TakeawaysΒΆ

βœ… Object detection = Classification + Localization

βœ… YOLO is the best choice for real-time applications

βœ… NMS removes duplicate detections

βœ… IoU measures box overlap (used in NMS and evaluation)

βœ… mAP (mean Average Precision) is the standard metric

βœ… Balance speed vs accuracy based on use case

Next: 03_clip_embeddings.ipynb - Multimodal understanding