Run this notebook: Open in Colab Open in Kaggle

Object Detection: YOLO, DETR & Beyond¶

Bounding box detection, segmentation, YOLO, DETR, and grounding DINO for real-world object detection tasks.

# Install dependencies
# !pip install ultralytics opencv-python pillow matplotlib

Bounding Box Fundamentals¶

Object Detection: Advanced Theory and Architecture Evolution¶

1. Problem Formulation¶

Object detection combines classification and localization:

Input: Image $I \in \mathbb{R}^{H \times W \times 3}$

Output: Set of detections $\mathcal{D} = \{(b_i, c_i, p_i)\}_{i=1}^N$ where:

$b_i = (x, y, w, h)$: Bounding box coordinates
$c_i \in \{1, \ldots, C\}$: Class label
$p_i \in [0, 1]$: Confidence score

Challenges:

Variable number of objects per image
Different object scales
Occlusion and truncation
Real-time inference requirements

2. R-CNN Family: Two-Stage Detectors¶

A. R-CNN (2014) - Regions with CNN¶

Pipeline:

Selective Search: Generate ~2000 region proposals
Warp: Resize each region to fixed size (e.g., 227×227)
CNN: Extract features with AlexNet/VGG
Classify: SVM classifier for each class
Regress: Bounding box refinement

Loss Function:

\[\mathcal{L} = \mathcal{L}_{\text{cls}}(p, c) + \lambda [c \geq 1] \mathcal{L}_{\text{loc}}(t, g)\]

where:

$\mathcal{L}_{\text{cls}}$: Classification loss (cross-entropy)
$\mathcal{L}_{\text{loc}}$: Localization loss (smooth L1)
$t$: Predicted box offsets
$g$: Ground truth offsets
$[c \geq 1]$: Indicator (only regress for objects, not background)

Box Parameterization:

\[t_x = (x - x_a) / w_a, \quad t_y = (y - y_a) / h_a\]

\[t_w = \log(w / w_a), \quad t_h = \log(h / h_a)\]

where $(x_a, y_a, w_a, h_a)$ is the anchor box.

Limitations:

Slow: ~47s per image
Multi-stage training (CNN, SVM, bbox regressor)
Disk-heavy feature caching

B. Fast R-CNN (2015)¶

Key Innovation: Share conv computation across proposals

Architecture:

Image → CNN (entire image) → RoI Pooling → FC layers → {cls, bbox}
          ↓
      Feature Map
          ↑
    Region Proposals (Selective Search)

RoI Pooling:

For region $r$ with size $h_r \times w_r$, divide into $H \times W$ grid:

\[\text{RoI-Pool}(r, F) = \max_{(i,j) \in \text{bin}(h,w)} F[i, j]\]

Output: Fixed $H \times W$ feature map regardless of input size.

Multi-task Loss:

\[\mathcal{L} = \mathcal{L}_{\text{cls}}(p, u) + \lambda [u \geq 1] \mathcal{L}_{\text{loc}}(t^u, v)\]

where $u$ is true class and $v$ is true box.

Smooth L1 Loss:

\[\mathcal{L}_{\text{loc}}(t, v) = \sum_{i \in \{x,y,w,h\}} \text{smooth}_{L1}(t_i - v_i)\]

\[\begin{split}\text{smooth}_{L1}(x) = \begin{cases} 0.5x^2 & \text{if } |x| < 1 \\ |x| - 0.5 & \text{otherwise} \end{cases}\end{split}\]

Advantages:

9× faster training, 140× faster inference than R-CNN
End-to-end training
Higher mAP

Remaining Bottleneck: Selective Search (2s per image)

C. Faster R-CNN (2015)¶

Key Innovation: Region Proposal Network (RPN)

RPN Architecture:

For each position on feature map, use k anchors with different scales/ratios:

\[\text{Anchors}: \{(w_i, h_i)\}_{i=1}^k\]

Common: 3 scales × 3 ratios = 9 anchors per location

RPN Outputs:

Objectness score: $p_{\text{obj}} \in [0, 1]$ (is object?)
Box refinement: $(t_x, t_y, t_w, t_h)$

RPN Loss:

\[\mathcal{L}_{\text{RPN}} = \frac{1}{N_{\text{cls}}} \sum_i \mathcal{L}_{\text{cls}}(p_i, p_i^*) + \frac{\lambda}{N_{\text{reg}}} \sum_i p_i^* \mathcal{L}_{\text{reg}}(t_i, t_i^*)\]

where $p_i^* = 1$ if anchor is positive (IoU > 0.7 with GT).

Training Strategy (4-step alternating):

Train RPN
Train Fast R-CNN with RPN proposals
Fine-tune RPN with fixed detector
Fine-tune Fast R-CNN with fixed RPN

Modern: Joint end-to-end training with shared conv layers.

Performance: 200ms per image (GPU), 73.2% mAP (PASCAL VOC)

3. YOLO Family: One-Stage Detectors¶

A. YOLOv1 (2016) - You Only Look Once¶

Philosophy: Treat detection as regression problem.

Architecture:

Divide image into $S \times S$ grid (e.g., 7×7)
Each cell predicts:
- $B$ bounding boxes (e.g., 2)
- Box confidence: $P(\text{Object}) \times \text{IoU}$
- $C$ class probabilities

Output Tensor: $S \times S \times (B \cdot 5 + C)$

For $S=7, B=2, C=20$: $7 \times 7 \times 30$

Loss Function (Multi-part):

\[\mathcal{L} = \lambda_{\text{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{obj}} [(x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2]\]

\[+ \lambda_{\text{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{obj}} [(\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2]\]

\[+ \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{obj}} (C_i - \hat{C}_i)^2\]

\[+ \lambda_{\text{noobj}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{noobj}} (C_i - \hat{C}_i)^2\]

\[+ \sum_{i=0}^{S^2} \mathbb{1}_i^{\text{obj}} \sum_{c \in \text{classes}} (p_i(c) - \hat{p}_i(c))^2\]

Weight terms:

$\lambda_{\text{coord}} = 5$: Increase localization loss importance
$\lambda_{\text{noobj}} = 0.5$: Decrease background confidence loss
$\sqrt{w}, \sqrt{h}$: Make loss more sensitive to small box errors

Advantages:

Extremely fast: 45 FPS (real-time)
Global context (sees entire image)
Fewer false positives on background

Limitations:

Struggles with small objects (grid limitation)
Each cell can only detect one object
Lower mAP than two-stage methods

B. YOLOv2 / YOLO9000 (2016)¶

Improvements:

Batch Normalization: After every conv layer (+2% mAP)
High-Resolution Classifier: Pre-train on 448×448 instead of 224×224
Anchor Boxes: Like Faster R-CNN (use k-means on dataset to find anchors)
Multi-Scale Training: Train on {320, 352, …, 608} randomly
Passthrough Layer: Concat high-res features for small objects

Dimension Priors:

Run k-means (k=5) on training boxes with IoU distance:

\[d(\text{box}, \text{centroid}) = 1 - \text{IoU}(\text{box}, \text{centroid})\]

Learns dataset-specific anchor shapes (e.g., tall for people, wide for cars).

C. YOLOv3 (2018)¶

Multi-Scale Predictions:

Detect at 3 scales using Feature Pyramid Network (FPN):

Large objects: 13×13 grid (stride 32)
Medium objects: 26×26 grid (stride 16)
Small objects: 52×52 grid (stride 8)

9 anchors total: 3 per scale

Darknet-53 Backbone:

53 conv layers with residual connections (similar to ResNet).

Logistic Regression for Objectness:

Replace softmax with sigmoid:

\[P(\text{obj}) = \sigma(t_o)\]

Allows one box to belong to multiple classes (e.g., “Woman” + “Person”).

Performance: 33 ms (30 FPS), 57.9% AP50 (COCO)

D. YOLOv4 (2020) - Bag of Freebies/Specials¶

Bag of Freebies (no inference cost):

Mosaic data augmentation (4 images → 1)
Self-adversarial training
CIoU loss (Complete IoU)
Label smoothing

Bag of Specials (slight cost):

Mish activation: $\text{Mish}(x) = x \cdot \tanh(\ln(1 + e^x))$
CSPDarknet53 backbone (Cross-Stage Partial)
SPP (Spatial Pyramid Pooling)
PAN (Path Aggregation Network)

CIoU Loss (Complete IoU):

\[\mathcal{L}_{\text{CIoU}} = 1 - \text{IoU} + \frac{\rho^2(b, b^{gt})}{c^2} + \alpha v\]

where:

$\rho$: Euclidean distance between box centers
$c$: Diagonal of smallest enclosing box
$v = \frac{4}{\pi^2} (\arctan \frac{w^{gt}}{h^{gt}} - \arctan \frac{w}{h})^2$: Aspect ratio consistency

Performance: 65 FPS, 43.5% AP (COCO)

E. YOLOv5-v8 (Modern)¶

YOLOv8 Architecture (latest):

Input (640×640)
    ↓
CSPDarknet Backbone (feature extraction)
    ↓
C2f modules (faster C3)
    ↓
PAN-FPN Neck (multi-scale fusion)
    ↓
Decoupled Head (separate cls/box branches)
    ↓
{bbox, objectness, class} predictions

Anchor-Free Detection:

Direct regression of box coordinates from grid cells (no predefined anchors).

TAL (Task-Aligned Learning):

\[t = s^\alpha \cdot u^\beta\]

where:

$s$: Classification score
$u$: IoU
$\alpha, \beta$: Hyperparameters

Aligns classification and localization quality.

4. Loss Functions Evolution¶

Loss	Formula	Focus
L1	$\|x - \hat{x}\|$	Simple, not scale-invariant
Smooth L1	$\begin{cases} 0.5x^2 & \|x\| < 1 \\ \|x\| - 0.5 & \text{else} \end{cases}$	Less sensitive to outliers
IoU	$1 - \frac{\text{Intersection}}{\text{Union}}$	Invariant to scale
GIoU	$\text{IoU} - \frac{\|C \setminus (A \cup B)\|}{\|C\|}$	Handles non-overlapping
DIoU	$\text{IoU} - \frac{d^2}{c^2}$	Minimizes center distance
CIoU	$\text{DIoU} - \alpha v$	Aspect ratio consistency

Why IoU-based losses?

Traditional L1/L2 on $(x, y, w, h)$ don’t directly optimize detection metric (IoU).

GIoU (Generalized IoU) gradient even when boxes don’t overlap:

\[\text{GIoU} = \text{IoU} - \frac{|C| - |A \cup B|}{|C|}\]

where $C$ is smallest enclosing box.

5. Evaluation Metrics¶

Precision & Recall:

\[\text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN}\]

Average Precision (AP):

Area under Precision-Recall curve:

\[\text{AP} = \int_0^1 p(r) dr\]

mAP (mean AP): Average AP across all classes

AP@IoU=0.5 (AP50): Detection correct if IoU ≥ 0.5

AP@[0.5:0.95] (COCO metric): Average over IoU thresholds {0.5, 0.55, …, 0.95}

Why AP, not accuracy?

Handles class imbalance
Captures both precision and recall
Threshold-independent

6. Modern Techniques¶

A. Feature Pyramid Networks (FPN):

Combine features from multiple scales:

Bottom-up: C2 → C3 → C4 → C5
             ↓    ↓    ↓    ↓
Top-down:   P2 ← P3 ← P4 ← P5

Lateral connections with 1×1 conv for dimension matching.

B. Focal Loss (RetinaNet):

Address class imbalance:

\[\text{FL}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)\]

where $p_t = p$ if $y=1$, else $1-p$.

Down-weights easy examples (high $p_t$), focuses on hard negatives.

C. Deformable Convolutions:

Learn spatial sampling offsets:

\[y(p_0) = \sum_{p_n \in \mathcal{R}} w(p_n) \cdot x(p_0 + p_n + \Delta p_n)\]

where $\Delta p_n$ are learned offsets. Adapts to object deformation.

D. Attention Mechanisms:

Spatial attention: Where to look (e.g., CBAM)
Channel attention: What features matter (e.g., SE-Net)

7. Comparison: Two-Stage vs One-Stage¶

Aspect	Two-Stage (Faster R-CNN)	One-Stage (YOLO/SSD)
Speed	Slower (region proposals)	Faster (direct regression)
Accuracy	Higher mAP	Lower mAP (improving)
Small Objects	Better (RoI pooling)	Challenging
Complexity	More complex	Simpler
Use Case	High-accuracy needed	Real-time critical

Modern Trend: Gap is closing—YOLOv8 matches Faster R-CNN accuracy while being much faster.

import numpy as np
from typing import List, Tuple
from dataclasses import dataclass

@dataclass
class BoundingBox:
    """Bounding box representation"""
    x: float  # Top-left x
    y: float  # Top-left y
    width: float
    height: float
    confidence: float
    class_id: int
    class_name: str
    
    @property
    def x_min(self) -> float:
        return self.x
    
    @property
    def y_min(self) -> float:
        return self.y
    
    @property
    def x_max(self) -> float:
        return self.x + self.width
    
    @property
    def y_max(self) -> float:
        return self.y + self.height
    
    @property
    def area(self) -> float:
        return self.width * self.height
    
    def to_xyxy(self) -> Tuple[float, float, float, float]:
        """Convert to [x_min, y_min, x_max, y_max] format"""
        return (self.x_min, self.y_min, self.x_max, self.y_max)
    
    def to_xywh(self) -> Tuple[float, float, float, float]:
        """Convert to [x, y, width, height] format"""
        return (self.x, self.y, self.width, self.height)

def compute_iou(box1: BoundingBox, box2: BoundingBox) -> float:
    """Compute Intersection over Union (IoU)"""
    # Intersection coordinates
    x_min = max(box1.x_min, box2.x_min)
    y_min = max(box1.y_min, box2.y_min)
    x_max = min(box1.x_max, box2.x_max)
    y_max = min(box1.y_max, box2.y_max)
    
    # Intersection area
    if x_max < x_min or y_max < y_min:
        return 0.0
    
    intersection = (x_max - x_min) * (y_max - y_min)
    
    # Union area
    union = box1.area + box2.area - intersection
    
    return intersection / union if union > 0 else 0.0

# Test IoU
box1 = BoundingBox(10, 10, 50, 50, 0.9, 0, "person")
box2 = BoundingBox(30, 30, 50, 50, 0.8, 0, "person")
box3 = BoundingBox(100, 100, 50, 50, 0.7, 1, "car")

print(f"IoU(box1, box2) = {compute_iou(box1, box2):.3f}  # Overlapping")
print(f"IoU(box1, box3) = {compute_iou(box1, box3):.3f}  # Non-overlapping")

Non-Maximum Suppression (NMS)¶

def non_max_suppression(boxes: List[BoundingBox], iou_threshold: float = 0.5) -> List[BoundingBox]:
    """Apply Non-Maximum Suppression to remove duplicate detections"""
    if not boxes:
        return []
    
    # Sort by confidence (highest first)
    boxes = sorted(boxes, key=lambda b: b.confidence, reverse=True)
    
    keep = []
    
    while boxes:
        # Take box with highest confidence
        best_box = boxes.pop(0)
        keep.append(best_box)
        
        # Remove boxes with high IoU
        boxes = [
            box for box in boxes
            if compute_iou(best_box, box) < iou_threshold
            or box.class_id != best_box.class_id  # Different class
        ]
    
    return keep

# Test NMS
detections = [
    BoundingBox(10, 10, 50, 50, 0.95, 0, "person"),
    BoundingBox(12, 12, 50, 50, 0.90, 0, "person"),  # Similar to first
    BoundingBox(15, 15, 50, 50, 0.85, 0, "person"),  # Similar to first
    BoundingBox(100, 100, 50, 50, 0.92, 1, "car"),
]

filtered = non_max_suppression(detections, iou_threshold=0.5)

print(f"Before NMS: {len(detections)} boxes")
print(f"After NMS:  {len(filtered)} boxes")
print("\nKept boxes:")
for box in filtered:
    print(f"  {box.class_name} @ ({box.x:.0f}, {box.y:.0f}): {box.confidence:.2f}")

# Advanced NMS Variants and IoU Implementations

import numpy as np
from typing import List, Tuple
import matplotlib.pyplot as plt

# ============================================================
# 1. IoU Variants Implementation
# ============================================================

def compute_giou(box1: BoundingBox, box2: BoundingBox) -> float:
    """
    Generalized IoU (GIoU) - handles non-overlapping boxes.
    
    GIoU = IoU - |C \ (A ∪ B)| / |C|
    
    where C is the smallest enclosing box.
    """
    # Standard IoU calculation
    iou = compute_iou(box1, box2)
    
    # Smallest enclosing box
    x_min = min(box1.x_min, box2.x_min)
    y_min = min(box1.y_min, box2.y_min)
    x_max = max(box1.x_max, box2.x_max)
    y_max = max(box1.y_max, box2.y_max)
    
    c_area = (x_max - x_min) * (y_max - y_min)
    union = box1.area + box2.area - (iou * (box1.area + box2.area) / (1 + iou) if iou > 0 else 0)
    
    # Union for actual calculation
    x_min_i = max(box1.x_min, box2.x_min)
    y_min_i = max(box1.y_min, box2.y_min)
    x_max_i = min(box1.x_max, box2.x_max)
    y_max_i = min(box1.y_max, box2.y_max)
    
    intersection = max(0, x_max_i - x_min_i) * max(0, y_max_i - y_min_i)
    union = box1.area + box2.area - intersection
    
    giou = iou - (c_area - union) / c_area if c_area > 0 else iou
    
    return giou

def compute_diou(box1: BoundingBox, box2: BoundingBox) -> float:
    """
    Distance IoU (DIoU) - considers center distance.
    
    DIoU = IoU - ρ²(b, b_gt) / c²
    
    where ρ is Euclidean distance between centers,
    c is diagonal of smallest enclosing box.
    """
    iou = compute_iou(box1, box2)
    
    # Center points
    cx1 = box1.x + box1.width / 2
    cy1 = box1.y + box1.height / 2
    cx2 = box2.x + box2.width / 2
    cy2 = box2.y + box2.height / 2
    
    # Center distance
    center_dist_sq = (cx1 - cx2) ** 2 + (cy1 - cy2) ** 2
    
    # Smallest enclosing box diagonal
    x_min = min(box1.x_min, box2.x_min)
    y_min = min(box1.y_min, box2.y_min)
    x_max = max(box1.x_max, box2.x_max)
    y_max = max(box1.y_max, box2.y_max)
    
    c_diag_sq = (x_max - x_min) ** 2 + (y_max - y_min) ** 2
    
    diou = iou - center_dist_sq / c_diag_sq if c_diag_sq > 0 else iou
    
    return diou

def compute_ciou(box1: BoundingBox, box2: BoundingBox) -> float:
    """
    Complete IoU (CIoU) - includes aspect ratio consistency.
    
    CIoU = DIoU - α·v
    
    where v measures aspect ratio consistency,
    α is a trade-off parameter.
    """
    diou = compute_diou(box1, box2)
    
    # Aspect ratio term
    v = (4 / (np.pi ** 2)) * (
        np.arctan(box1.width / (box1.height + 1e-7)) -
        np.arctan(box2.width / (box2.height + 1e-7))
    ) ** 2
    
    # Trade-off parameter
    iou = compute_iou(box1, box2)
    alpha = v / (1 - iou + v + 1e-7)
    
    ciou = diou - alpha * v
    
    return ciou

# ============================================================
# 2. Advanced NMS Variants
# ============================================================

def soft_nms(boxes: List[BoundingBox], 
             sigma: float = 0.5,
             score_threshold: float = 0.001) -> List[BoundingBox]:
    """
    Soft-NMS: Decay scores instead of hard suppression.
    
    Instead of removing boxes, decay their confidence:
    
    s_i = s_i · exp(-IoU²(M, b_i) / σ)
    
    Better for occluded objects.
    """
    if not boxes:
        return []
    
    # Create mutable copy with scores
    boxes_with_scores = [(box, box.confidence) for box in boxes]
    keep = []
    
    while boxes_with_scores:
        # Find box with max score
        max_idx = max(range(len(boxes_with_scores)), 
                     key=lambda i: boxes_with_scores[i][1])
        best_box, best_score = boxes_with_scores.pop(max_idx)
        
        if best_score < score_threshold:
            break
        
        keep.append(best_box)
        
        # Decay scores of remaining boxes
        updated = []
        for box, score in boxes_with_scores:
            if box.class_id == best_box.class_id:
                iou = compute_iou(best_box, box)
                # Gaussian decay
                new_score = score * np.exp(-(iou ** 2) / sigma)
                updated.append((box, new_score))
            else:
                updated.append((box, score))
        
        boxes_with_scores = updated
    
    return keep

def nms_with_giou(boxes: List[BoundingBox], 
                  iou_threshold: float = 0.5) -> List[BoundingBox]:
    """NMS using GIoU instead of IoU for better overlap handling."""
    if not boxes:
        return []
    
    boxes = sorted(boxes, key=lambda b: b.confidence, reverse=True)
    keep = []
    
    while boxes:
        best_box = boxes.pop(0)
        keep.append(best_box)
        
        boxes = [
            box for box in boxes
            if compute_giou(best_box, box) < iou_threshold
            or box.class_id != best_box.class_id
        ]
    
    return keep

# ============================================================
# 3. Visualization of IoU Variants
# ============================================================

# Create test boxes
box_a = BoundingBox(20, 20, 60, 60, 0.9, 0, "obj")
box_b = BoundingBox(50, 50, 60, 60, 0.8, 0, "obj")  # Overlapping
box_c = BoundingBox(100, 20, 40, 80, 0.85, 0, "obj")  # Non-overlapping

boxes_to_test = [
    ("Overlapping", box_a, box_b),
    ("Non-overlapping", box_a, box_c),
]

print("="*70)
print("IoU VARIANT COMPARISON")
print("="*70)

for scenario, b1, b2 in boxes_to_test:
    iou = compute_iou(b1, b2)
    giou = compute_giou(b1, b2)
    diou = compute_diou(b1, b2)
    ciou = compute_ciou(b1, b2)
    
    print(f"\n{scenario}:")
    print(f"  IoU:   {iou:7.4f}")
    print(f"  GIoU:  {giou:7.4f} (gradient for non-overlap: {giou if iou == 0 else 'N/A'})")
    print(f"  DIoU:  {diou:7.4f} (considers center distance)")
    print(f"  CIoU:  {ciou:7.4f} (aspect ratio consistency)")

print("\n" + "="*70)
print("KEY INSIGHTS")
print("="*70)
print("• IoU:  Classic metric, but gradient vanishes when boxes don't overlap")
print("• GIoU: Provides gradient even for non-overlapping boxes")
print("• DIoU: Faster convergence by minimizing center distance")
print("• CIoU: Best for training—matches aspect ratio + position + overlap")
print("="*70)

# ============================================================
# 4. NMS Variants Comparison
# ============================================================

# Create clustered detections (simulating multiple detections of same object)
detections_clustered = [
    BoundingBox(50, 50, 100, 100, 0.95, 0, "person"),
    BoundingBox(52, 52, 102, 98, 0.93, 0, "person"),
    BoundingBox(48, 51, 98, 102, 0.91, 0, "person"),
    BoundingBox(55, 48, 95, 105, 0.88, 0, "person"),
    BoundingBox(200, 200, 80, 80, 0.90, 1, "car"),
    BoundingBox(202, 198, 82, 82, 0.87, 1, "car"),
]

print("\n" + "="*70)
print("NMS VARIANT COMPARISON")
print("="*70)
print(f"Original detections: {len(detections_clustered)}")

# Standard NMS
standard_nms = non_max_suppression(detections_clustered.copy(), iou_threshold=0.5)
print(f"\nStandard NMS:  {len(standard_nms)} boxes kept")

# Soft-NMS
soft_nms_result = soft_nms(detections_clustered.copy(), sigma=0.5, score_threshold=0.3)
print(f"Soft-NMS:      {len(soft_nms_result)} boxes kept (gentler suppression)")

# GIoU-NMS
giou_nms_result = nms_with_giou(detections_clustered.copy(), iou_threshold=0.5)
print(f"GIoU-NMS:      {len(giou_nms_result)} boxes kept (better for difficult cases)")

print("\n" + "="*70)
print("RECOMMENDATIONS")
print("="*70)
print("• Standard NMS:  Fast, works well for most cases")
print("• Soft-NMS:      Better for occluded/crowded scenes (keeps more boxes)")
print("• GIoU-NMS:      More robust to box orientation/aspect ratio")
print("="*70)

YOLO Object Detector¶

# YOLO with Ultralytics (requires installation)
'''
from ultralytics import YOLO
import cv2

# Load YOLOv8 model
model = YOLO('yolov8n.pt')  # nano model (fastest)
# Other options: yolov8s, yolov8m, yolov8l, yolov8x

# Detect objects in image
results = model('path/to/image.jpg')

# Process results
for result in results:
    boxes = result.boxes  # Bounding boxes
    for box in boxes:
        x1, y1, x2, y2 = box.xyxy[0]
        conf = box.conf[0]
        cls = box.cls[0]
        print(f"Detected {model.names[int(cls)]} at ({x1:.0f}, {y1:.0f}) with confidence {conf:.2f}")

# Real-time detection from webcam
cap = cv2.VideoCapture(0)
while True:
    ret, frame = cap.read()
    if not ret:
        break
    
    results = model(frame, stream=True)
    
    for result in results:
        annotated = result.plot()  # Draw boxes
        cv2.imshow('YOLO', annotated)
    
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()
'''

print("YOLOv8 detection example (commented - requires ultralytics)")
print("\nYOLO Models:")
print("  yolov8n - Nano (fastest, least accurate)")
print("  yolov8s - Small")
print("  yolov8m - Medium")
print("  yolov8l - Large")
print("  yolov8x - Extra Large (slowest, most accurate)")

Custom Object Detector¶

class ObjectDetector:
    """Simple object detection wrapper"""
    
    def __init__(self, conf_threshold: float = 0.5, iou_threshold: float = 0.5):
        self.conf_threshold = conf_threshold
        self.iou_threshold = iou_threshold
        self.class_names = self._load_class_names()
    
    def _load_class_names(self) -> List[str]:
        """Load COCO class names"""
        # COCO 80 classes (subset shown)
        return [
            'person', 'bicycle', 'car', 'motorcycle', 'airplane',
            'bus', 'train', 'truck', 'boat', 'traffic light',
            'cat', 'dog', 'horse', 'sheep', 'cow'
            # ... 65 more classes
        ]
    
    def detect(self, image: np.ndarray) -> List[BoundingBox]:
        """Detect objects in image"""
        # Simulate detections
        raw_detections = self._simulate_detections()
        
        # Filter by confidence
        filtered = [d for d in raw_detections if d.confidence >= self.conf_threshold]
        
        # Apply NMS
        final_detections = non_max_suppression(filtered, self.iou_threshold)
        
        return final_detections
    
    def _simulate_detections(self) -> List[BoundingBox]:
        """Simulate raw model output"""
        # In production: actual model inference
        return [
            BoundingBox(50, 50, 100, 150, 0.95, 0, "person"),
            BoundingBox(52, 52, 100, 150, 0.92, 0, "person"),  # Duplicate
            BoundingBox(200, 100, 80, 60, 0.88, 2, "car"),
            BoundingBox(150, 300, 50, 50, 0.76, 10, "cat"),
            BoundingBox(400, 200, 120, 100, 0.42, 3, "motorcycle"),  # Low conf
        ]
    
    def visualize(self, image: np.ndarray, boxes: List[BoundingBox]) -> np.ndarray:
        """Draw boxes on image"""
        # In production: use cv2.rectangle() to draw boxes
        print(f"\nWould draw {len(boxes)} boxes on image:")
        for box in boxes:
            print(f"  {box.class_name}: ({box.x:.0f}, {box.y:.0f}, {box.width:.0f}, {box.height:.0f}) - {box.confidence:.2f}")
        return image

# Test detector
detector = ObjectDetector(conf_threshold=0.5, iou_threshold=0.5)

# Dummy image
image = np.zeros((640, 640, 3), dtype=np.uint8)
detections = detector.detect(image)

print(f"\nDetected {len(detections)} objects:")
for det in detections:
    print(f"  {det.class_name}: {det.confidence:.2%} at ({det.x:.0f}, {det.y:.0f})")

# Visualize
annotated = detector.visualize(image, detections)

Evaluation Metrics¶

def compute_precision_recall(predictions: List[BoundingBox], 
                              ground_truth: List[BoundingBox],
                              iou_threshold: float = 0.5) -> Tuple[float, float]:
    """Compute precision and recall"""
    true_positives = 0
    matched_gt = set()
    
    for pred in predictions:
        best_iou = 0
        best_gt_idx = -1
        
        for idx, gt in enumerate(ground_truth):
            if gt.class_id != pred.class_id:
                continue
            
            iou = compute_iou(pred, gt)
            if iou > best_iou:
                best_iou = iou
                best_gt_idx = idx
        
        if best_iou >= iou_threshold and best_gt_idx not in matched_gt:
            true_positives += 1
            matched_gt.add(best_gt_idx)
    
    false_positives = len(predictions) - true_positives
    false_negatives = len(ground_truth) - len(matched_gt)
    
    precision = true_positives / (true_positives + false_positives) if predictions else 0
    recall = true_positives / (true_positives + false_negatives) if ground_truth else 0
    
    return precision, recall

def compute_ap(precisions: List[float], recalls: List[float]) -> float:
    """Compute Average Precision (AP)"""
    # Sort by recall
    sorted_indices = np.argsort(recalls)
    recalls = np.array(recalls)[sorted_indices]
    precisions = np.array(precisions)[sorted_indices]
    
    # Compute AP using 11-point interpolation
    ap = 0
    for t in np.arange(0, 1.1, 0.1):
        if np.sum(recalls >= t) == 0:
            p = 0
        else:
            p = np.max(precisions[recalls >= t])
        ap += p / 11
    
    return ap

# Test metrics
pred_boxes = [
    BoundingBox(10, 10, 50, 50, 0.9, 0, "person"),
    BoundingBox(100, 100, 50, 50, 0.8, 1, "car"),
]

gt_boxes = [
    BoundingBox(12, 12, 50, 50, 1.0, 0, "person"),
    BoundingBox(102, 102, 50, 50, 1.0, 1, "car"),
    BoundingBox(200, 200, 50, 50, 1.0, 2, "dog"),  # Missed
]

precision, recall = compute_precision_recall(pred_boxes, gt_boxes)
print(f"Precision: {precision:.2%}")
print(f"Recall: {recall:.2%}")
print(f"F1 Score: {2 * precision * recall / (precision + recall):.2%}")

Production Deployment¶

import time
from collections import deque

class ProductionDetector:
    """Production-ready object detector"""
    
    def __init__(self, model_name: str = "yolov8n"):
        self.model_name = model_name
        self.detector = ObjectDetector()
        self.stats = {
            "total_images": 0,
            "total_detections": 0,
            "avg_inference_time": 0,
            "fps_history": deque(maxlen=30)
        }
    
    def detect_with_timing(self, image: np.ndarray) -> Tuple[List[BoundingBox], float]:
        """Detect with performance tracking"""
        start = time.time()
        detections = self.detector.detect(image)
        inference_time = time.time() - start
        
        # Update stats
        self.stats["total_images"] += 1
        self.stats["total_detections"] += len(detections)
        self.stats["fps_history"].append(1 / inference_time if inference_time > 0 else 0)
        self.stats["avg_inference_time"] = (
            (self.stats["avg_inference_time"] * (self.stats["total_images"] - 1) + inference_time)
            / self.stats["total_images"]
        )
        
        return detections, inference_time
    
    def get_performance_stats(self) -> dict:
        """Get performance statistics"""
        avg_fps = np.mean(self.stats["fps_history"]) if self.stats["fps_history"] else 0
        
        return {
            "total_images": self.stats["total_images"],
            "total_detections": self.stats["total_detections"],
            "avg_detections_per_image": (
                self.stats["total_detections"] / max(self.stats["total_images"], 1)
            ),
            "avg_inference_time_ms": self.stats["avg_inference_time"] * 1000,
            "avg_fps": avg_fps
        }

# Test production detector
prod_detector = ProductionDetector()

# Process images
for i in range(10):
    image = np.zeros((640, 640, 3), dtype=np.uint8)
    detections, time_ms = prod_detector.detect_with_timing(image)

# Print stats
stats = prod_detector.get_performance_stats()
print("\nPerformance Statistics:")
print(f"  Total Images: {stats['total_images']}")
print(f"  Total Detections: {stats['total_detections']}")
print(f"  Avg Detections/Image: {stats['avg_detections_per_image']:.1f}")
print(f"  Avg Inference Time: {stats['avg_inference_time_ms']:.2f}ms")
print(f"  Avg FPS: {stats['avg_fps']:.1f}")

# Anchor Generation and Assignment

import numpy as np
import matplotlib.pyplot as plt
from typing import List, Tuple

# ============================================================
# 1. Anchor Box Generation (Faster R-CNN style)
# ============================================================

def generate_anchors(base_size: int = 16,
                     ratios: List[float] = [0.5, 1.0, 2.0],
                     scales: List[int] = [8, 16, 32]) -> np.ndarray:
    """
    Generate anchor boxes with different scales and aspect ratios.
    
    Parameters:
    -----------
    base_size : Base anchor size
    ratios : Aspect ratios (h/w)
    scales : Scales relative to base_size
    
    Returns:
    --------
    anchors : (k, 4) array of anchors in (x_min, y_min, x_max, y_max) format
    """
    anchors = []
    
    for scale in scales:
        for ratio in ratios:
            # Compute width and height
            h = base_size * scale * np.sqrt(ratio)
            w = base_size * scale / np.sqrt(ratio)
            
            # Center at (0, 0)
            x_min = -w / 2
            y_min = -h / 2
            x_max = w / 2
            y_max = h / 2
            
            anchors.append([x_min, y_min, x_max, y_max])
    
    return np.array(anchors)

# Generate default anchors
anchors = generate_anchors(base_size=16, ratios=[0.5, 1.0, 2.0], scales=[8, 16, 32])

print("="*70)
print("ANCHOR BOX GENERATION")
print("="*70)
print(f"Generated {len(anchors)} anchors:")
print(f"  3 aspect ratios × 3 scales = 9 anchors per position")
print("\nAnchor dimensions (width × height):")
for idx, anchor in enumerate(anchors):
    w = anchor[2] - anchor[0]
    h = anchor[3] - anchor[1]
    ratio = h / w
    print(f"  Anchor {idx+1}: {w:6.1f} × {h:6.1f}  (ratio: {ratio:.2f})")

# ============================================================
# 2. K-Means Anchor Clustering (YOLO style)
# ============================================================

def kmeans_anchors(boxes: np.ndarray, k: int = 9, max_iters: int = 100) -> np.ndarray:
    """
    Run k-means clustering on box dimensions using IoU distance.
    
    Parameters:
    -----------
    boxes : (n, 2) array of (width, height)
    k : Number of clusters
    
    Returns:
    --------
    anchors : (k, 2) array of anchor (width, height)
    """
    n = boxes.shape[0]
    
    # Random initialization
    np.random.seed(42)
    anchors = boxes[np.random.choice(n, k, replace=False)]
    
    def iou_wh(wh1, wh2):
        """IoU for width-height pairs (assuming aligned at center)"""
        w1, h1 = wh1
        w2, h2 = wh2
        
        inter = np.minimum(w1, w2) * np.minimum(h1, h2)
        union = w1 * h1 + w2 * h2 - inter
        
        return inter / (union + 1e-7)
    
    for iteration in range(max_iters):
        # Assign boxes to nearest anchor
        distances = np.zeros((n, k))
        for i, box in enumerate(boxes):
            for j, anchor in enumerate(anchors):
                distances[i, j] = 1 - iou_wh(box, anchor)  # Distance = 1 - IoU
        
        assignments = np.argmin(distances, axis=1)
        
        # Update anchors
        new_anchors = np.zeros((k, 2))
        for j in range(k):
            cluster_boxes = boxes[assignments == j]
            if len(cluster_boxes) > 0:
                new_anchors[j] = cluster_boxes.mean(axis=0)
            else:
                new_anchors[j] = anchors[j]  # Keep old if no assignment
        
        # Check convergence
        if np.allclose(anchors, new_anchors):
            break
        
        anchors = new_anchors
    
    # Sort by area
    areas = anchors[:, 0] * anchors[:, 1]
    sorted_indices = np.argsort(areas)
    anchors = anchors[sorted_indices]
    
    return anchors

# Simulate COCO-like box distribution
np.random.seed(42)
n_boxes = 1000

# Generate realistic box distributions
# Small objects (people, animals): 30-100 pixels
small_boxes = np.random.uniform(30, 100, (400, 2))

# Medium objects (cars, furniture): 100-250 pixels  
medium_boxes = np.random.uniform(100, 250, (400, 2))

# Large objects (buildings, scenes): 250-500 pixels
large_boxes = np.random.uniform(250, 500, (200, 2))

all_boxes = np.vstack([small_boxes, medium_boxes, large_boxes])

# Run k-means
learned_anchors = kmeans_anchors(all_boxes, k=9, max_iters=50)

print("\n" + "="*70)
print("K-MEANS ANCHOR LEARNING (YOLO-style)")
print("="*70)
print(f"Learned {len(learned_anchors)} anchors from {len(all_boxes)} boxes:")
print("\nAnchor dimensions (width × height):")
for idx, (w, h) in enumerate(learned_anchors):
    ratio = h / w
    area = w * h
    print(f"  Anchor {idx+1}: {w:6.1f} × {h:6.1f}  (ratio: {ratio:.2f}, area: {area:8.0f})")

# ============================================================
# 3. Anchor Assignment Strategy
# ============================================================

def assign_anchors_to_gt(gt_boxes: np.ndarray,
                         anchors: np.ndarray,
                         pos_iou_thresh: float = 0.7,
                         neg_iou_thresh: float = 0.3) -> Tuple[np.ndarray, np.ndarray]:
    """
    Assign anchors to ground truth boxes (Faster R-CNN strategy).
    
    Parameters:
    -----------
    gt_boxes : (m, 4) ground truth boxes [x_min, y_min, x_max, y_max]
    anchors : (n, 4) anchor boxes
    
    Returns:
    --------
    labels : (n,) array {-1: ignore, 0: background, 1: object}
    targets : (n, 4) box regression targets
    """
    n_anchors = len(anchors)
    n_gt = len(gt_boxes)
    
    labels = -np.ones(n_anchors, dtype=np.int32)  # -1 = ignore
    targets = np.zeros((n_anchors, 4))
    
    if n_gt == 0:
        labels[:] = 0  # All background
        return labels, targets
    
    # Compute IoU matrix
    ious = np.zeros((n_anchors, n_gt))
    for i, anchor in enumerate(anchors):
        for j, gt in enumerate(gt_boxes):
            # Compute IoU (simplified for demonstration)
            x_min = max(anchor[0], gt[0])
            y_min = max(anchor[1], gt[1])
            x_max = min(anchor[2], gt[2])
            y_max = min(anchor[3], gt[3])
            
            inter = max(0, x_max - x_min) * max(0, y_max - y_min)
            
            area_a = (anchor[2] - anchor[0]) * (anchor[3] - anchor[1])
            area_g = (gt[2] - gt[0]) * (gt[3] - gt[1])
            union = area_a + area_g - inter
            
            ious[i, j] = inter / (union + 1e-7)
    
    # Assign labels
    max_iou_per_anchor = ious.max(axis=1)
    max_gt_per_anchor = ious.argmax(axis=1)
    
    # Rule 1: IoU > pos_thresh → positive
    labels[max_iou_per_anchor >= pos_iou_thresh] = 1
    
    # Rule 2: IoU < neg_thresh → negative
    labels[max_iou_per_anchor < neg_iou_thresh] = 0
    
    # Rule 3: For each GT, assign anchor with highest IoU
    max_iou_per_gt = ious.max(axis=0)
    for j in range(n_gt):
        best_anchor = ious[:, j].argmax()
        labels[best_anchor] = 1
        max_gt_per_anchor[best_anchor] = j
    
    # Compute box regression targets (for positive anchors)
    for i in range(n_anchors):
        if labels[i] == 1:
            anchor = anchors[i]
            gt = gt_boxes[max_gt_per_anchor[i]]
            
            # Parameterized offsets
            ax_ctr = (anchor[0] + anchor[2]) / 2
            ay_ctr = (anchor[1] + anchor[3]) / 2
            aw = anchor[2] - anchor[0]
            ah = anchor[3] - anchor[1]
            
            gx_ctr = (gt[0] + gt[2]) / 2
            gy_ctr = (gt[1] + gt[3]) / 2
            gw = gt[2] - gt[0]
            gh = gt[3] - gt[1]
            
            targets[i, 0] = (gx_ctr - ax_ctr) / aw
            targets[i, 1] = (gy_ctr - ay_ctr) / ah
            targets[i, 2] = np.log(gw / aw)
            targets[i, 3] = np.log(gh / ah)
    
    return labels, targets

# Test anchor assignment
test_gt = np.array([[100, 100, 200, 200], [300, 150, 450, 300]])
test_anchors = np.array([
    [90, 90, 210, 210],    # High IoU with GT1
    [150, 150, 250, 250],  # Medium IoU with GT1
    [295, 145, 455, 305],  # High IoU with GT2
    [500, 500, 550, 550],  # No overlap (background)
])

labels, targets = assign_anchors_to_gt(test_gt, test_anchors, pos_iou_thresh=0.7, neg_iou_thresh=0.3)

print("\n" + "="*70)
print("ANCHOR ASSIGNMENT EXAMPLE")
print("="*70)
print(f"Ground Truth boxes: {len(test_gt)}")
print(f"Anchors: {len(test_anchors)}")
print("\nAssignment results:")
for i, (label, target) in enumerate(zip(labels, targets)):
    status = {-1: "IGNORE", 0: "BACKGROUND", 1: "OBJECT"}[label]
    print(f"  Anchor {i+1}: {status:12s}", end="")
    if label == 1:
        print(f" → targets: ({target[0]:6.3f}, {target[1]:6.3f}, {target[2]:6.3f}, {target[3]:6.3f})")
    else:
        print()

print("\n" + "="*70)
print("ASSIGNMENT RULES")
print("="*70)
print("1. IoU ≥ 0.7 with any GT → POSITIVE (object)")
print("2. IoU < 0.3 with all GT → NEGATIVE (background)")
print("3. 0.3 ≤ IoU < 0.7 → IGNORE (ambiguous, don't train)")
print("4. Best anchor for each GT → POSITIVE (ensure every GT has match)")
print("="*70)

Data Augmentation for Object Detection¶

Data augmentation is crucial for training robust object detectors. Unlike classification, augmentation must transform both images AND bounding box annotations consistently.

1. Geometric Augmentations¶

Random Horizontal Flip¶

Operation: Flip image left-right
Box transformation: $x' = W - x$, where $W$ is image width
Use case: Objects with horizontal symmetry (cars, people)
Probability: Typically 0.5

Random Scaling and Translation¶

Scale: Resize image by factor $s \in [s_{min}, s_{max}]$
Translate: Shift by $(\Delta x, \Delta y)$
Box update: $$x_{new} = s \cdot x + \Delta x$$y_{new} = s \cdot y + \Delta y$$
Typical ranges: $s \in [0.8, 1.2]$, $|\Delta| \leq 0.1W$

Random Rotation¶

Operation: Rotate by angle $\theta$
Box transformation: Compute corners, rotate, find new axis-aligned bbox
Challenge: Bounding box becomes larger after rotation
Alternative: Use oriented bounding boxes (OBB)

2. Mosaic Augmentation (YOLO v4+)¶

Combines 4 images into one mosaic:

\[\begin{split}I_{mosaic} = \begin{bmatrix} I_1 & I_2 \\ I_3 & I_4 \end{bmatrix}\end{split}\]

Procedure:

Sample 4 images
Resize to random scales
Place at 4 quadrants with random center point
Adjust all bounding boxes to new coordinates

Benefits:

Exposes model to more objects per batch
Forces model to learn from different scales simultaneously
Improves small object detection
Reduces batch normalization artifacts

3. MixUp for Object Detection¶

Blend two images with ratio $\lambda \sim Beta(\alpha, \alpha)$:

\[I_{mix} = \lambda I_1 + (1-\lambda) I_2\]

\[\mathcal{B}_{mix} = \mathcal{B}_1 \cup \mathcal{B}_2\]

Modifications for detection:

Keep all bounding boxes from both images
Optional: Weight box confidence by $\lambda$
Typical $\alpha = 1.5$ (vs. $\alpha = 1.0$ in classification)

4. Copy-Paste Augmentation¶

From instance segmentation masks:

Extract object from image 1 using mask
Paste onto image 2 at random location
Add bounding box to image 2 annotations

Advanced: Use Poisson blending for seamless integration

5. Color/Photometric Augmentations¶

These don’t affect bounding boxes:

Augmentation	Operation	Range
Brightness	$I' = I + \beta$	$\beta \in [-30, 30]$
Contrast	$I' = \alpha I$	$\alpha \in [0.8, 1.2]$
Saturation	Adjust in HSV space	$\times [0.7, 1.3]$
Hue	Shift hue channel	$\pm 10°$

HSV transformation: $$I_{HSV}' = \begin{bmatrix} H + \Delta H \\ \alpha_S \cdot S \\ \alpha_V \cdot V \end{bmatrix}$$

6. Random Crop and Padding¶

IoU-based Cropping (SSD-style):¶

Sample crop with IoU $\in \{0.1, 0.3, 0.5, 0.7, 0.9, 1.0\}$ with some GT box
Reject if crop doesn’t contain any object center
Adjust boxes: clip to crop boundaries, remove if center outside

Padding:¶

Add borders: top, bottom, left, right
Keep boxes unchanged (still valid in larger canvas)
Useful for preserving aspect ratio

7. Advanced Techniques¶

CutOut / Random Erasing¶

Randomly mask rectangular regions
For detection: Avoid erasing object centers
Forces model to use partial information

AutoAugment for Detection¶

Learn augmentation policy via RL:

Search space: 20+ operations with magnitude
Optimize for mAP on validation set
Computationally expensive but effective

Test-Time Augmentation (TTA)¶

At inference:

Apply multiple augmentations to input
Run detector on each
Aggregate predictions (NMS across all)

8. Augmentation Pipeline¶

Training loop:

For each image:
Mosaic (prob=0.5)
Random scale (0.5-1.5×)
Random flip (prob=0.5)
HSV adjustment (always)
MixUp (prob=0.1)
Random crop (IoU-based)
Normalize and resize to input size

Key principles:

Always maintain box-image consistency
Discard boxes with area < threshold (e.g., 16 pixels)
Clip boxes to image boundaries
Remove empty boxes (IoU with image = 0)

9. Augmentation Strength Scheduling¶

Warm-up phase (epochs 1-10):

Mild augmentation: flip, small scale, color jitter
Helps stable training start

Main phase (epochs 10-270):

Full augmentation: mosaic, mixup, aggressive scaling

Fine-tuning phase (epochs 270-300):

Disable mosaic/mixup
Only flip and mild scaling
Helps model adapt to real data distribution

10. Domain-Specific Considerations¶

Domain	Key Augmentations	Avoid
Autonomous Driving	Flip, scale, cutout, weather effects	Rotation (road is horizontal)
Retail (shelf detection)	Scale, brightness, crop	Flip (text orientation matters)
Aerial Imagery	Rotation, flip, scale	Extreme crops (context matters)
Medical Imaging	Rotation, flip, elastic deformation	Color jitter (preserves diagnostic info)

Best practice: Analyze failure cases, add targeted augmentations to address them.

# Data Augmentation Implementations

import numpy as np
import matplotlib.pyplot as plt
from typing import List, Tuple
import cv2

# ============================================================
# 1. Geometric Augmentations with Box Updates
# ============================================================

def horizontal_flip(image: np.ndarray, boxes: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
    """
    Flip image horizontally and update boxes.
    
    Parameters:
    -----------
    image : (H, W, C) image array
    boxes : (n, 4) boxes in [x_min, y_min, x_max, y_max] format
    
    Returns:
    --------
    flipped_image, flipped_boxes
    """
    H, W = image.shape[:2]
    
    # Flip image
    flipped_image = np.fliplr(image)
    
    # Update boxes: x' = W - x
    flipped_boxes = boxes.copy()
    flipped_boxes[:, [0, 2]] = W - boxes[:, [2, 0]]  # Swap and flip x coordinates
    
    return flipped_image, flipped_boxes

def random_scale_translate(image: np.ndarray, 
                           boxes: np.ndarray,
                           scale_range: Tuple[float, float] = (0.8, 1.2),
                           translate_range: float = 0.1) -> Tuple[np.ndarray, np.ndarray]:
    """
    Randomly scale and translate image and boxes.
    """
    H, W = image.shape[:2]
    
    # Random scale and translation
    scale = np.random.uniform(*scale_range)
    tx = np.random.uniform(-translate_range * W, translate_range * W)
    ty = np.random.uniform(-translate_range * H, translate_range * H)
    
    # Transformation matrix
    M = np.array([[scale, 0, tx],
                  [0, scale, ty]])
    
    # Transform image
    new_H, new_W = int(H * scale), int(W * scale)
    transformed = cv2.warpAffine(image, M, (new_W, new_H))
    
    # Transform boxes
    transformed_boxes = boxes.copy()
    transformed_boxes[:, [0, 2]] = transformed_boxes[:, [0, 2]] * scale + tx
    transformed_boxes[:, [1, 3]] = transformed_boxes[:, [1, 3]] * scale + ty
    
    # Clip to image boundaries
    transformed_boxes[:, [0, 2]] = np.clip(transformed_boxes[:, [0, 2]], 0, new_W)
    transformed_boxes[:, [1, 3]] = np.clip(transformed_boxes[:, [1, 3]], 0, new_H)
    
    return transformed, transformed_boxes

# ============================================================
# 2. Mosaic Augmentation
# ============================================================

def create_mosaic(images: List[np.ndarray], 
                  boxes_list: List[np.ndarray],
                  output_size: int = 640) -> Tuple[np.ndarray, np.ndarray]:
    """
    Create mosaic from 4 images (YOLO-style).
    
    Parameters:
    -----------
    images : List of 4 images
    boxes_list : List of 4 box arrays (each n_i × 4)
    output_size : Output mosaic size
    
    Returns:
    --------
    mosaic_image, mosaic_boxes
    """
    assert len(images) == 4, "Need exactly 4 images for mosaic"
    
    # Random center point
    cx = np.random.randint(output_size // 4, 3 * output_size // 4)
    cy = np.random.randint(output_size // 4, 3 * output_size // 4)
    
    mosaic = np.zeros((output_size, output_size, 3), dtype=np.uint8)
    mosaic_boxes = []
    
    # Quadrant offsets: top-left, top-right, bottom-left, bottom-right
    quadrants = [
        (0, 0, cx, cy),          # Top-left
        (cx, 0, output_size, cy),  # Top-right
        (0, cy, cx, output_size),  # Bottom-left
        (cx, cy, output_size, output_size)  # Bottom-right
    ]
    
    for idx, (img, boxes) in enumerate(zip(images, boxes_list)):
        x1, y1, x2, y2 = quadrants[idx]
        quad_w, quad_h = x2 - x1, y2 - y1
        
        # Resize image to fit quadrant
        img_resized = cv2.resize(img, (quad_w, quad_h))
        
        # Place in mosaic
        mosaic[y1:y2, x1:x2] = img_resized
        
        # Transform boxes
        if len(boxes) > 0:
            H, W = img.shape[:2]
            scale_x = quad_w / W
            scale_y = quad_h / H
            
            transformed_boxes = boxes.copy()
            transformed_boxes[:, [0, 2]] = boxes[:, [0, 2]] * scale_x + x1
            transformed_boxes[:, [1, 3]] = boxes[:, [1, 3]] * scale_y + y1
            
            # Clip to mosaic boundaries
            transformed_boxes[:, [0, 2]] = np.clip(transformed_boxes[:, [0, 2]], 0, output_size)
            transformed_boxes[:, [1, 3]] = np.clip(transformed_boxes[:, [1, 3]], 0, output_size)
            
            mosaic_boxes.append(transformed_boxes)
    
    mosaic_boxes = np.vstack(mosaic_boxes) if mosaic_boxes else np.zeros((0, 4))
    
    return mosaic, mosaic_boxes

# ============================================================
# 3. MixUp for Detection
# ============================================================

def mixup_detection(image1: np.ndarray, boxes1: np.ndarray,
                    image2: np.ndarray, boxes2: np.ndarray,
                    alpha: float = 1.5) -> Tuple[np.ndarray, np.ndarray]:
    """
    Apply MixUp to two images and combine their boxes.
    """
    # Sample mixing ratio
    lam = np.random.beta(alpha, alpha)
    
    # Mix images
    mixed_image = (lam * image1 + (1 - lam) * image2).astype(np.uint8)
    
    # Combine boxes
    mixed_boxes = np.vstack([boxes1, boxes2]) if len(boxes1) > 0 and len(boxes2) > 0 else boxes1
    
    return mixed_image, mixed_boxes

# ============================================================
# 4. Color Jittering
# ============================================================

def color_jitter(image: np.ndarray,
                 brightness: float = 30,
                 contrast: Tuple[float, float] = (0.8, 1.2),
                 saturation: Tuple[float, float] = (0.7, 1.3),
                 hue: float = 10) -> np.ndarray:
    """
    Apply random color jittering in HSV space.
    """
    # Convert to HSV
    hsv = cv2.cvtColor(image, cv2.COLOR_RGB2HSV).astype(np.float32)
    
    # Hue shift
    hsv[:, :, 0] += np.random.uniform(-hue, hue)
    hsv[:, :, 0] = np.clip(hsv[:, :, 0], 0, 179)
    
    # Saturation scaling
    hsv[:, :, 1] *= np.random.uniform(*saturation)
    hsv[:, :, 1] = np.clip(hsv[:, :, 1], 0, 255)
    
    # Value (brightness) scaling
    hsv[:, :, 2] *= np.random.uniform(*contrast)
    hsv[:, :, 2] = np.clip(hsv[:, :, 2], 0, 255)
    
    # Convert back to RGB
    jittered = cv2.cvtColor(hsv.astype(np.uint8), cv2.COLOR_HSV2RGB)
    
    return jittered

# ============================================================
# 5. Demonstration
# ============================================================

# Create synthetic images with boxes
def create_test_image(color, box):
    """Helper to create test image with one box"""
    img = np.full((300, 300, 3), color, dtype=np.uint8)
    x1, y1, x2, y2 = box
    cv2.rectangle(img, (int(x1), int(y1)), (int(x2), int(y2)), (255, 255, 255), 2)
    return img

# Test images
img1 = create_test_image([200, 100, 100], [50, 50, 150, 150])
boxes1 = np.array([[50, 50, 150, 150]])

img2 = create_test_image([100, 200, 100], [180, 180, 280, 280])
boxes2 = np.array([[180, 180, 280, 280]])

img3 = create_test_image([100, 100, 200], [60, 120, 160, 220])
boxes3 = np.array([[60, 120, 160, 220]])

img4 = create_test_image([200, 200, 100], [100, 50, 250, 150])
boxes4 = np.array([[100, 50, 250, 150]])

# Apply augmentations
flipped_img, flipped_boxes = horizontal_flip(img1, boxes1)
mosaic_img, mosaic_boxes = create_mosaic([img1, img2, img3, img4], 
                                         [boxes1, boxes2, boxes3, boxes4], 
                                         output_size=400)
mixed_img, mixed_boxes = mixup_detection(img1, boxes1, img2, boxes2, alpha=1.5)
jittered_img = color_jitter(img1.copy())

# Visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Object Detection Data Augmentation Examples', fontsize=16, fontweight='bold')

# Original
axes[0, 0].imshow(img1)
axes[0, 0].add_patch(plt.Rectangle((boxes1[0, 0], boxes1[0, 1]), 
                                   boxes1[0, 2] - boxes1[0, 0], 
                                   boxes1[0, 3] - boxes1[0, 1],
                                   fill=False, edgecolor='yellow', linewidth=2))
axes[0, 0].set_title('Original Image')
axes[0, 0].axis('off')

# Horizontal Flip
axes[0, 1].imshow(flipped_img)
axes[0, 1].add_patch(plt.Rectangle((flipped_boxes[0, 0], flipped_boxes[0, 1]), 
                                   flipped_boxes[0, 2] - flipped_boxes[0, 0], 
                                   flipped_boxes[0, 3] - flipped_boxes[0, 1],
                                   fill=False, edgecolor='yellow', linewidth=2))
axes[0, 1].set_title('Horizontal Flip')
axes[0, 1].axis('off')

# Mosaic
axes[0, 2].imshow(mosaic_img)
for box in mosaic_boxes:
    axes[0, 2].add_patch(plt.Rectangle((box[0], box[1]), box[2] - box[0], box[3] - box[1],
                                       fill=False, edgecolor='yellow', linewidth=2))
axes[0, 2].set_title(f'Mosaic ({len(mosaic_boxes)} boxes)')
axes[0, 2].axis('off')

# MixUp
axes[1, 0].imshow(mixed_img)
for box in mixed_boxes:
    axes[1, 0].add_patch(plt.Rectangle((box[0], box[1]), box[2] - box[0], box[3] - box[1],
                                       fill=False, edgecolor='yellow', linewidth=2))
axes[1, 0].set_title(f'MixUp ({len(mixed_boxes)} boxes)')
axes[1, 0].axis('off')

# Color Jitter
axes[1, 1].imshow(jittered_img)
axes[1, 1].add_patch(plt.Rectangle((boxes1[0, 0], boxes1[0, 1]), 
                                   boxes1[0, 2] - boxes1[0, 0], 
                                   boxes1[0, 3] - boxes1[0, 1],
                                   fill=False, edgecolor='yellow', linewidth=2))
axes[1, 1].set_title('Color Jitter (HSV)')
axes[1, 1].axis('off')

# Statistics
axes[1, 2].axis('off')
stats_text = f"""
AUGMENTATION STATISTICS

Original: {len(boxes1)} box
Flip: {len(flipped_boxes)} box
Mosaic: {len(mosaic_boxes)} boxes
MixUp: {len(mixed_boxes)} boxes

KEY INSIGHTS:
• Mosaic combines 4 images
• All boxes preserved
• Coordinates transformed
• Boundaries clipped

RECOMMENDATIONS:
✓ Flip: 50% probability
✓ Mosaic: 50% in training
✓ MixUp: 10% probability
✓ Color: Always apply
✓ Disable mosaic in last
  10 epochs for fine-tuning
"""
axes[1, 2].text(0.1, 0.5, stats_text, fontsize=11, verticalalignment='center',
                family='monospace', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.3))

plt.tight_layout()
plt.show()

print("="*70)
print("DATA AUGMENTATION SUMMARY")
print("="*70)
print(f"✓ Horizontal flip: Boxes transformed correctly")
print(f"✓ Mosaic: Combined {len(mosaic_boxes)} boxes from 4 images")
print(f"✓ MixUp: Merged {len(mixed_boxes)} boxes")
print(f"✓ Color jitter: No box transformation needed")
print("="*70)

Best Practices¶

1. Model Selection¶

Real-time (>30 FPS): YOLOv8n, YOLOv8s
Balanced: YOLOv8m, DETR
High accuracy: YOLOv8x, Faster R-CNN
Edge devices: YOLOv8n with TensorRT

2. Hyperparameters¶

Confidence threshold: 0.25-0.5 (lower = more detections)
IoU threshold (NMS): 0.45-0.65 (lower = fewer duplicates)
Image size: 640×640 (YOLO), can go lower for speed

3. Training Tips¶

Use data augmentation (mosaic, mixup)
Balance classes with weighted sampling
Train on high-resolution images
Freeze backbone initially, then fine-tune

4. Optimization¶

Convert to ONNX/TensorRT for faster inference
Batch processing when possible
Resize images to smaller sizes (320×320, 416×416)
Use FP16 precision on GPUs

Common Use Cases¶

Autonomous vehicles: Detect pedestrians, cars, traffic signs
Surveillance: People counting, intrusion detection
Retail: Product detection, shelf monitoring
Manufacturing: Defect detection, quality control
Agriculture: Crop monitoring, pest detection

Key Takeaways¶

✅ Object detection = Classification + Localization

✅ YOLO is the best choice for real-time applications

✅ NMS removes duplicate detections

✅ IoU measures box overlap (used in NMS and evaluation)

✅ mAP (mean Average Precision) is the standard metric

✅ Balance speed vs accuracy based on use case

Next: 03_clip_embeddings.ipynb - Multimodal understanding

Loss	Formula	Focus
L1	\(\|x - \hat{x}\|\)	Simple, not scale-invariant
Smooth L1	\(\begin{cases} 0.5x^2 & \|x\| < 1 \\ \|x\| - 0.5 & \text{else} \end{cases}\)	Less sensitive to outliers
IoU	\(1 - \frac{\text{Intersection}}{\text{Union}}\)	Invariant to scale
GIoU	\(\text{IoU} - \frac{\|C \setminus (A \cup B)\|}{\|C\|}\)	Handles non-overlapping
DIoU	\(\text{IoU} - \frac{d^2}{c^2}\)	Minimizes center distance
CIoU	\(\text{DIoU} - \alpha v\)	Aspect ratio consistency

Augmentation	Operation	Range
Brightness	\(I' = I + \beta\)	\(\beta \in [-30, 30]\)
Contrast	\(I' = \alpha I\)	\(\alpha \in [0.8, 1.2]\)
Saturation	Adjust in HSV space	\(\times [0.7, 1.3]\)
Hue	Shift hue channel	\(\pm 10°\)