Object Detection: YOLO, DETR & BeyondΒΆ
Bounding box detection, segmentation, YOLO, DETR, and grounding DINO for real-world object detection tasks.
# Install dependencies
# !pip install ultralytics opencv-python pillow matplotlib
Bounding Box FundamentalsΒΆ
Object Detection: Advanced Theory and Architecture EvolutionΒΆ
1. Problem FormulationΒΆ
Object detection combines classification and localization:
Input: Image \(I \in \mathbb{R}^{H \times W \times 3}\)
Output: Set of detections \(\mathcal{D} = \{(b_i, c_i, p_i)\}_{i=1}^N\) where:
\(b_i = (x, y, w, h)\): Bounding box coordinates
\(c_i \in \{1, \ldots, C\}\): Class label
\(p_i \in [0, 1]\): Confidence score
Challenges:
Variable number of objects per image
Different object scales
Occlusion and truncation
Real-time inference requirements
2. R-CNN Family: Two-Stage DetectorsΒΆ
A. R-CNN (2014) - Regions with CNNΒΆ
Pipeline:
Selective Search: Generate ~2000 region proposals
Warp: Resize each region to fixed size (e.g., 227Γ227)
CNN: Extract features with AlexNet/VGG
Classify: SVM classifier for each class
Regress: Bounding box refinement
Loss Function:
where:
\(\mathcal{L}_{\text{cls}}\): Classification loss (cross-entropy)
\(\mathcal{L}_{\text{loc}}\): Localization loss (smooth L1)
\(t\): Predicted box offsets
\(g\): Ground truth offsets
\([c \geq 1]\): Indicator (only regress for objects, not background)
Box Parameterization:
where \((x_a, y_a, w_a, h_a)\) is the anchor box.
Limitations:
Slow: ~47s per image
Multi-stage training (CNN, SVM, bbox regressor)
Disk-heavy feature caching
B. Fast R-CNN (2015)ΒΆ
Key Innovation: Share conv computation across proposals
Architecture:
Image β CNN (entire image) β RoI Pooling β FC layers β {cls, bbox}
β
Feature Map
β
Region Proposals (Selective Search)
RoI Pooling:
For region \(r\) with size \(h_r \times w_r\), divide into \(H \times W\) grid:
Output: Fixed \(H \times W\) feature map regardless of input size.
Multi-task Loss:
where \(u\) is true class and \(v\) is true box.
Smooth L1 Loss:
Advantages:
9Γ faster training, 140Γ faster inference than R-CNN
End-to-end training
Higher mAP
Remaining Bottleneck: Selective Search (2s per image)
C. Faster R-CNN (2015)ΒΆ
Key Innovation: Region Proposal Network (RPN)
RPN Architecture:
For each position on feature map, use k anchors with different scales/ratios:
Common: 3 scales Γ 3 ratios = 9 anchors per location
RPN Outputs:
Objectness score: \(p_{\text{obj}} \in [0, 1]\) (is object?)
Box refinement: \((t_x, t_y, t_w, t_h)\)
RPN Loss:
where \(p_i^* = 1\) if anchor is positive (IoU > 0.7 with GT).
Training Strategy (4-step alternating):
Train RPN
Train Fast R-CNN with RPN proposals
Fine-tune RPN with fixed detector
Fine-tune Fast R-CNN with fixed RPN
Modern: Joint end-to-end training with shared conv layers.
Performance: 200ms per image (GPU), 73.2% mAP (PASCAL VOC)
3. YOLO Family: One-Stage DetectorsΒΆ
A. YOLOv1 (2016) - You Only Look OnceΒΆ
Philosophy: Treat detection as regression problem.
Architecture:
Divide image into \(S \times S\) grid (e.g., 7Γ7)
Each cell predicts:
\(B\) bounding boxes (e.g., 2)
Box confidence: \(P(\text{Object}) \times \text{IoU}\)
\(C\) class probabilities
Output Tensor: \(S \times S \times (B \cdot 5 + C)\)
For \(S=7, B=2, C=20\): \(7 \times 7 \times 30\)
Loss Function (Multi-part):
Weight terms:
\(\lambda_{\text{coord}} = 5\): Increase localization loss importance
\(\lambda_{\text{noobj}} = 0.5\): Decrease background confidence loss
\(\sqrt{w}, \sqrt{h}\): Make loss more sensitive to small box errors
Advantages:
Extremely fast: 45 FPS (real-time)
Global context (sees entire image)
Fewer false positives on background
Limitations:
Struggles with small objects (grid limitation)
Each cell can only detect one object
Lower mAP than two-stage methods
B. YOLOv2 / YOLO9000 (2016)ΒΆ
Improvements:
Batch Normalization: After every conv layer (+2% mAP)
High-Resolution Classifier: Pre-train on 448Γ448 instead of 224Γ224
Anchor Boxes: Like Faster R-CNN (use k-means on dataset to find anchors)
Multi-Scale Training: Train on {320, 352, β¦, 608} randomly
Passthrough Layer: Concat high-res features for small objects
Dimension Priors:
Run k-means (k=5) on training boxes with IoU distance:
Learns dataset-specific anchor shapes (e.g., tall for people, wide for cars).
C. YOLOv3 (2018)ΒΆ
Multi-Scale Predictions:
Detect at 3 scales using Feature Pyramid Network (FPN):
Large objects: 13Γ13 grid (stride 32)
Medium objects: 26Γ26 grid (stride 16)
Small objects: 52Γ52 grid (stride 8)
9 anchors total: 3 per scale
Darknet-53 Backbone:
53 conv layers with residual connections (similar to ResNet).
Logistic Regression for Objectness:
Replace softmax with sigmoid:
Allows one box to belong to multiple classes (e.g., βWomanβ + βPersonβ).
Performance: 33 ms (30 FPS), 57.9% AP50 (COCO)
D. YOLOv4 (2020) - Bag of Freebies/SpecialsΒΆ
Bag of Freebies (no inference cost):
Mosaic data augmentation (4 images β 1)
Self-adversarial training
CIoU loss (Complete IoU)
Label smoothing
Bag of Specials (slight cost):
Mish activation: \(\text{Mish}(x) = x \cdot \tanh(\ln(1 + e^x))\)
CSPDarknet53 backbone (Cross-Stage Partial)
SPP (Spatial Pyramid Pooling)
PAN (Path Aggregation Network)
CIoU Loss (Complete IoU):
where:
\(\rho\): Euclidean distance between box centers
\(c\): Diagonal of smallest enclosing box
\(v = \frac{4}{\pi^2} (\arctan \frac{w^{gt}}{h^{gt}} - \arctan \frac{w}{h})^2\): Aspect ratio consistency
Performance: 65 FPS, 43.5% AP (COCO)
E. YOLOv5-v8 (Modern)ΒΆ
YOLOv8 Architecture (latest):
Input (640Γ640)
β
CSPDarknet Backbone (feature extraction)
β
C2f modules (faster C3)
β
PAN-FPN Neck (multi-scale fusion)
β
Decoupled Head (separate cls/box branches)
β
{bbox, objectness, class} predictions
Anchor-Free Detection:
Direct regression of box coordinates from grid cells (no predefined anchors).
TAL (Task-Aligned Learning):
where:
\(s\): Classification score
\(u\): IoU
\(\alpha, \beta\): Hyperparameters
Aligns classification and localization quality.
4. Loss Functions EvolutionΒΆ
Loss |
Formula |
Focus |
|---|---|---|
L1 |
\(|x - \hat{x}|\) |
Simple, not scale-invariant |
Smooth L1 |
\(\begin{cases} 0.5x^2 & |x| < 1 \\ |x| - 0.5 & \text{else} \end{cases}\) |
Less sensitive to outliers |
IoU |
\(1 - \frac{\text{Intersection}}{\text{Union}}\) |
Invariant to scale |
GIoU |
\(\text{IoU} - \frac{|C \setminus (A \cup B)|}{|C|}\) |
Handles non-overlapping |
DIoU |
\(\text{IoU} - \frac{d^2}{c^2}\) |
Minimizes center distance |
CIoU |
\(\text{DIoU} - \alpha v\) |
Aspect ratio consistency |
Why IoU-based losses?
Traditional L1/L2 on \((x, y, w, h)\) donβt directly optimize detection metric (IoU).
GIoU (Generalized IoU) gradient even when boxes donβt overlap:
where \(C\) is smallest enclosing box.
5. Evaluation MetricsΒΆ
Precision & Recall:
Average Precision (AP):
Area under Precision-Recall curve:
mAP (mean AP): Average AP across all classes
AP@IoU=0.5 (AP50): Detection correct if IoU β₯ 0.5
AP@[0.5:0.95] (COCO metric): Average over IoU thresholds {0.5, 0.55, β¦, 0.95}
Why AP, not accuracy?
Handles class imbalance
Captures both precision and recall
Threshold-independent
6. Modern TechniquesΒΆ
A. Feature Pyramid Networks (FPN):
Combine features from multiple scales:
Bottom-up: C2 β C3 β C4 β C5
β β β β
Top-down: P2 β P3 β P4 β P5
Lateral connections with 1Γ1 conv for dimension matching.
B. Focal Loss (RetinaNet):
Address class imbalance:
where \(p_t = p\) if \(y=1\), else \(1-p\).
Down-weights easy examples (high \(p_t\)), focuses on hard negatives.
C. Deformable Convolutions:
Learn spatial sampling offsets:
where \(\Delta p_n\) are learned offsets. Adapts to object deformation.
D. Attention Mechanisms:
Spatial attention: Where to look (e.g., CBAM)
Channel attention: What features matter (e.g., SE-Net)
7. Comparison: Two-Stage vs One-StageΒΆ
Aspect |
Two-Stage (Faster R-CNN) |
One-Stage (YOLO/SSD) |
|---|---|---|
Speed |
Slower (region proposals) |
Faster (direct regression) |
Accuracy |
Higher mAP |
Lower mAP (improving) |
Small Objects |
Better (RoI pooling) |
Challenging |
Complexity |
More complex |
Simpler |
Use Case |
High-accuracy needed |
Real-time critical |
Modern Trend: Gap is closingβYOLOv8 matches Faster R-CNN accuracy while being much faster.
import numpy as np
from typing import List, Tuple
from dataclasses import dataclass
@dataclass
class BoundingBox:
"""Bounding box representation"""
x: float # Top-left x
y: float # Top-left y
width: float
height: float
confidence: float
class_id: int
class_name: str
@property
def x_min(self) -> float:
return self.x
@property
def y_min(self) -> float:
return self.y
@property
def x_max(self) -> float:
return self.x + self.width
@property
def y_max(self) -> float:
return self.y + self.height
@property
def area(self) -> float:
return self.width * self.height
def to_xyxy(self) -> Tuple[float, float, float, float]:
"""Convert to [x_min, y_min, x_max, y_max] format"""
return (self.x_min, self.y_min, self.x_max, self.y_max)
def to_xywh(self) -> Tuple[float, float, float, float]:
"""Convert to [x, y, width, height] format"""
return (self.x, self.y, self.width, self.height)
def compute_iou(box1: BoundingBox, box2: BoundingBox) -> float:
"""Compute Intersection over Union (IoU)"""
# Intersection coordinates
x_min = max(box1.x_min, box2.x_min)
y_min = max(box1.y_min, box2.y_min)
x_max = min(box1.x_max, box2.x_max)
y_max = min(box1.y_max, box2.y_max)
# Intersection area
if x_max < x_min or y_max < y_min:
return 0.0
intersection = (x_max - x_min) * (y_max - y_min)
# Union area
union = box1.area + box2.area - intersection
return intersection / union if union > 0 else 0.0
# Test IoU
box1 = BoundingBox(10, 10, 50, 50, 0.9, 0, "person")
box2 = BoundingBox(30, 30, 50, 50, 0.8, 0, "person")
box3 = BoundingBox(100, 100, 50, 50, 0.7, 1, "car")
print(f"IoU(box1, box2) = {compute_iou(box1, box2):.3f} # Overlapping")
print(f"IoU(box1, box3) = {compute_iou(box1, box3):.3f} # Non-overlapping")
Non-Maximum Suppression (NMS)ΒΆ
def non_max_suppression(boxes: List[BoundingBox], iou_threshold: float = 0.5) -> List[BoundingBox]:
"""Apply Non-Maximum Suppression to remove duplicate detections"""
if not boxes:
return []
# Sort by confidence (highest first)
boxes = sorted(boxes, key=lambda b: b.confidence, reverse=True)
keep = []
while boxes:
# Take box with highest confidence
best_box = boxes.pop(0)
keep.append(best_box)
# Remove boxes with high IoU
boxes = [
box for box in boxes
if compute_iou(best_box, box) < iou_threshold
or box.class_id != best_box.class_id # Different class
]
return keep
# Test NMS
detections = [
BoundingBox(10, 10, 50, 50, 0.95, 0, "person"),
BoundingBox(12, 12, 50, 50, 0.90, 0, "person"), # Similar to first
BoundingBox(15, 15, 50, 50, 0.85, 0, "person"), # Similar to first
BoundingBox(100, 100, 50, 50, 0.92, 1, "car"),
]
filtered = non_max_suppression(detections, iou_threshold=0.5)
print(f"Before NMS: {len(detections)} boxes")
print(f"After NMS: {len(filtered)} boxes")
print("\nKept boxes:")
for box in filtered:
print(f" {box.class_name} @ ({box.x:.0f}, {box.y:.0f}): {box.confidence:.2f}")
# Advanced NMS Variants and IoU Implementations
import numpy as np
from typing import List, Tuple
import matplotlib.pyplot as plt
# ============================================================
# 1. IoU Variants Implementation
# ============================================================
def compute_giou(box1: BoundingBox, box2: BoundingBox) -> float:
"""
Generalized IoU (GIoU) - handles non-overlapping boxes.
GIoU = IoU - |C \ (A βͺ B)| / |C|
where C is the smallest enclosing box.
"""
# Standard IoU calculation
iou = compute_iou(box1, box2)
# Smallest enclosing box
x_min = min(box1.x_min, box2.x_min)
y_min = min(box1.y_min, box2.y_min)
x_max = max(box1.x_max, box2.x_max)
y_max = max(box1.y_max, box2.y_max)
c_area = (x_max - x_min) * (y_max - y_min)
union = box1.area + box2.area - (iou * (box1.area + box2.area) / (1 + iou) if iou > 0 else 0)
# Union for actual calculation
x_min_i = max(box1.x_min, box2.x_min)
y_min_i = max(box1.y_min, box2.y_min)
x_max_i = min(box1.x_max, box2.x_max)
y_max_i = min(box1.y_max, box2.y_max)
intersection = max(0, x_max_i - x_min_i) * max(0, y_max_i - y_min_i)
union = box1.area + box2.area - intersection
giou = iou - (c_area - union) / c_area if c_area > 0 else iou
return giou
def compute_diou(box1: BoundingBox, box2: BoundingBox) -> float:
"""
Distance IoU (DIoU) - considers center distance.
DIoU = IoU - ΟΒ²(b, b_gt) / cΒ²
where Ο is Euclidean distance between centers,
c is diagonal of smallest enclosing box.
"""
iou = compute_iou(box1, box2)
# Center points
cx1 = box1.x + box1.width / 2
cy1 = box1.y + box1.height / 2
cx2 = box2.x + box2.width / 2
cy2 = box2.y + box2.height / 2
# Center distance
center_dist_sq = (cx1 - cx2) ** 2 + (cy1 - cy2) ** 2
# Smallest enclosing box diagonal
x_min = min(box1.x_min, box2.x_min)
y_min = min(box1.y_min, box2.y_min)
x_max = max(box1.x_max, box2.x_max)
y_max = max(box1.y_max, box2.y_max)
c_diag_sq = (x_max - x_min) ** 2 + (y_max - y_min) ** 2
diou = iou - center_dist_sq / c_diag_sq if c_diag_sq > 0 else iou
return diou
def compute_ciou(box1: BoundingBox, box2: BoundingBox) -> float:
"""
Complete IoU (CIoU) - includes aspect ratio consistency.
CIoU = DIoU - Ξ±Β·v
where v measures aspect ratio consistency,
Ξ± is a trade-off parameter.
"""
diou = compute_diou(box1, box2)
# Aspect ratio term
v = (4 / (np.pi ** 2)) * (
np.arctan(box1.width / (box1.height + 1e-7)) -
np.arctan(box2.width / (box2.height + 1e-7))
) ** 2
# Trade-off parameter
iou = compute_iou(box1, box2)
alpha = v / (1 - iou + v + 1e-7)
ciou = diou - alpha * v
return ciou
# ============================================================
# 2. Advanced NMS Variants
# ============================================================
def soft_nms(boxes: List[BoundingBox],
sigma: float = 0.5,
score_threshold: float = 0.001) -> List[BoundingBox]:
"""
Soft-NMS: Decay scores instead of hard suppression.
Instead of removing boxes, decay their confidence:
s_i = s_i Β· exp(-IoUΒ²(M, b_i) / Ο)
Better for occluded objects.
"""
if not boxes:
return []
# Create mutable copy with scores
boxes_with_scores = [(box, box.confidence) for box in boxes]
keep = []
while boxes_with_scores:
# Find box with max score
max_idx = max(range(len(boxes_with_scores)),
key=lambda i: boxes_with_scores[i][1])
best_box, best_score = boxes_with_scores.pop(max_idx)
if best_score < score_threshold:
break
keep.append(best_box)
# Decay scores of remaining boxes
updated = []
for box, score in boxes_with_scores:
if box.class_id == best_box.class_id:
iou = compute_iou(best_box, box)
# Gaussian decay
new_score = score * np.exp(-(iou ** 2) / sigma)
updated.append((box, new_score))
else:
updated.append((box, score))
boxes_with_scores = updated
return keep
def nms_with_giou(boxes: List[BoundingBox],
iou_threshold: float = 0.5) -> List[BoundingBox]:
"""NMS using GIoU instead of IoU for better overlap handling."""
if not boxes:
return []
boxes = sorted(boxes, key=lambda b: b.confidence, reverse=True)
keep = []
while boxes:
best_box = boxes.pop(0)
keep.append(best_box)
boxes = [
box for box in boxes
if compute_giou(best_box, box) < iou_threshold
or box.class_id != best_box.class_id
]
return keep
# ============================================================
# 3. Visualization of IoU Variants
# ============================================================
# Create test boxes
box_a = BoundingBox(20, 20, 60, 60, 0.9, 0, "obj")
box_b = BoundingBox(50, 50, 60, 60, 0.8, 0, "obj") # Overlapping
box_c = BoundingBox(100, 20, 40, 80, 0.85, 0, "obj") # Non-overlapping
boxes_to_test = [
("Overlapping", box_a, box_b),
("Non-overlapping", box_a, box_c),
]
print("="*70)
print("IoU VARIANT COMPARISON")
print("="*70)
for scenario, b1, b2 in boxes_to_test:
iou = compute_iou(b1, b2)
giou = compute_giou(b1, b2)
diou = compute_diou(b1, b2)
ciou = compute_ciou(b1, b2)
print(f"\n{scenario}:")
print(f" IoU: {iou:7.4f}")
print(f" GIoU: {giou:7.4f} (gradient for non-overlap: {giou if iou == 0 else 'N/A'})")
print(f" DIoU: {diou:7.4f} (considers center distance)")
print(f" CIoU: {ciou:7.4f} (aspect ratio consistency)")
print("\n" + "="*70)
print("KEY INSIGHTS")
print("="*70)
print("β’ IoU: Classic metric, but gradient vanishes when boxes don't overlap")
print("β’ GIoU: Provides gradient even for non-overlapping boxes")
print("β’ DIoU: Faster convergence by minimizing center distance")
print("β’ CIoU: Best for trainingβmatches aspect ratio + position + overlap")
print("="*70)
# ============================================================
# 4. NMS Variants Comparison
# ============================================================
# Create clustered detections (simulating multiple detections of same object)
detections_clustered = [
BoundingBox(50, 50, 100, 100, 0.95, 0, "person"),
BoundingBox(52, 52, 102, 98, 0.93, 0, "person"),
BoundingBox(48, 51, 98, 102, 0.91, 0, "person"),
BoundingBox(55, 48, 95, 105, 0.88, 0, "person"),
BoundingBox(200, 200, 80, 80, 0.90, 1, "car"),
BoundingBox(202, 198, 82, 82, 0.87, 1, "car"),
]
print("\n" + "="*70)
print("NMS VARIANT COMPARISON")
print("="*70)
print(f"Original detections: {len(detections_clustered)}")
# Standard NMS
standard_nms = non_max_suppression(detections_clustered.copy(), iou_threshold=0.5)
print(f"\nStandard NMS: {len(standard_nms)} boxes kept")
# Soft-NMS
soft_nms_result = soft_nms(detections_clustered.copy(), sigma=0.5, score_threshold=0.3)
print(f"Soft-NMS: {len(soft_nms_result)} boxes kept (gentler suppression)")
# GIoU-NMS
giou_nms_result = nms_with_giou(detections_clustered.copy(), iou_threshold=0.5)
print(f"GIoU-NMS: {len(giou_nms_result)} boxes kept (better for difficult cases)")
print("\n" + "="*70)
print("RECOMMENDATIONS")
print("="*70)
print("β’ Standard NMS: Fast, works well for most cases")
print("β’ Soft-NMS: Better for occluded/crowded scenes (keeps more boxes)")
print("β’ GIoU-NMS: More robust to box orientation/aspect ratio")
print("="*70)
YOLO Object DetectorΒΆ
# YOLO with Ultralytics (requires installation)
'''
from ultralytics import YOLO
import cv2
# Load YOLOv8 model
model = YOLO('yolov8n.pt') # nano model (fastest)
# Other options: yolov8s, yolov8m, yolov8l, yolov8x
# Detect objects in image
results = model('path/to/image.jpg')
# Process results
for result in results:
boxes = result.boxes # Bounding boxes
for box in boxes:
x1, y1, x2, y2 = box.xyxy[0]
conf = box.conf[0]
cls = box.cls[0]
print(f"Detected {model.names[int(cls)]} at ({x1:.0f}, {y1:.0f}) with confidence {conf:.2f}")
# Real-time detection from webcam
cap = cv2.VideoCapture(0)
while True:
ret, frame = cap.read()
if not ret:
break
results = model(frame, stream=True)
for result in results:
annotated = result.plot() # Draw boxes
cv2.imshow('YOLO', annotated)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
'''
print("YOLOv8 detection example (commented - requires ultralytics)")
print("\nYOLO Models:")
print(" yolov8n - Nano (fastest, least accurate)")
print(" yolov8s - Small")
print(" yolov8m - Medium")
print(" yolov8l - Large")
print(" yolov8x - Extra Large (slowest, most accurate)")
Custom Object DetectorΒΆ
class ObjectDetector:
"""Simple object detection wrapper"""
def __init__(self, conf_threshold: float = 0.5, iou_threshold: float = 0.5):
self.conf_threshold = conf_threshold
self.iou_threshold = iou_threshold
self.class_names = self._load_class_names()
def _load_class_names(self) -> List[str]:
"""Load COCO class names"""
# COCO 80 classes (subset shown)
return [
'person', 'bicycle', 'car', 'motorcycle', 'airplane',
'bus', 'train', 'truck', 'boat', 'traffic light',
'cat', 'dog', 'horse', 'sheep', 'cow'
# ... 65 more classes
]
def detect(self, image: np.ndarray) -> List[BoundingBox]:
"""Detect objects in image"""
# Simulate detections
raw_detections = self._simulate_detections()
# Filter by confidence
filtered = [d for d in raw_detections if d.confidence >= self.conf_threshold]
# Apply NMS
final_detections = non_max_suppression(filtered, self.iou_threshold)
return final_detections
def _simulate_detections(self) -> List[BoundingBox]:
"""Simulate raw model output"""
# In production: actual model inference
return [
BoundingBox(50, 50, 100, 150, 0.95, 0, "person"),
BoundingBox(52, 52, 100, 150, 0.92, 0, "person"), # Duplicate
BoundingBox(200, 100, 80, 60, 0.88, 2, "car"),
BoundingBox(150, 300, 50, 50, 0.76, 10, "cat"),
BoundingBox(400, 200, 120, 100, 0.42, 3, "motorcycle"), # Low conf
]
def visualize(self, image: np.ndarray, boxes: List[BoundingBox]) -> np.ndarray:
"""Draw boxes on image"""
# In production: use cv2.rectangle() to draw boxes
print(f"\nWould draw {len(boxes)} boxes on image:")
for box in boxes:
print(f" {box.class_name}: ({box.x:.0f}, {box.y:.0f}, {box.width:.0f}, {box.height:.0f}) - {box.confidence:.2f}")
return image
# Test detector
detector = ObjectDetector(conf_threshold=0.5, iou_threshold=0.5)
# Dummy image
image = np.zeros((640, 640, 3), dtype=np.uint8)
detections = detector.detect(image)
print(f"\nDetected {len(detections)} objects:")
for det in detections:
print(f" {det.class_name}: {det.confidence:.2%} at ({det.x:.0f}, {det.y:.0f})")
# Visualize
annotated = detector.visualize(image, detections)
Evaluation MetricsΒΆ
def compute_precision_recall(predictions: List[BoundingBox],
ground_truth: List[BoundingBox],
iou_threshold: float = 0.5) -> Tuple[float, float]:
"""Compute precision and recall"""
true_positives = 0
matched_gt = set()
for pred in predictions:
best_iou = 0
best_gt_idx = -1
for idx, gt in enumerate(ground_truth):
if gt.class_id != pred.class_id:
continue
iou = compute_iou(pred, gt)
if iou > best_iou:
best_iou = iou
best_gt_idx = idx
if best_iou >= iou_threshold and best_gt_idx not in matched_gt:
true_positives += 1
matched_gt.add(best_gt_idx)
false_positives = len(predictions) - true_positives
false_negatives = len(ground_truth) - len(matched_gt)
precision = true_positives / (true_positives + false_positives) if predictions else 0
recall = true_positives / (true_positives + false_negatives) if ground_truth else 0
return precision, recall
def compute_ap(precisions: List[float], recalls: List[float]) -> float:
"""Compute Average Precision (AP)"""
# Sort by recall
sorted_indices = np.argsort(recalls)
recalls = np.array(recalls)[sorted_indices]
precisions = np.array(precisions)[sorted_indices]
# Compute AP using 11-point interpolation
ap = 0
for t in np.arange(0, 1.1, 0.1):
if np.sum(recalls >= t) == 0:
p = 0
else:
p = np.max(precisions[recalls >= t])
ap += p / 11
return ap
# Test metrics
pred_boxes = [
BoundingBox(10, 10, 50, 50, 0.9, 0, "person"),
BoundingBox(100, 100, 50, 50, 0.8, 1, "car"),
]
gt_boxes = [
BoundingBox(12, 12, 50, 50, 1.0, 0, "person"),
BoundingBox(102, 102, 50, 50, 1.0, 1, "car"),
BoundingBox(200, 200, 50, 50, 1.0, 2, "dog"), # Missed
]
precision, recall = compute_precision_recall(pred_boxes, gt_boxes)
print(f"Precision: {precision:.2%}")
print(f"Recall: {recall:.2%}")
print(f"F1 Score: {2 * precision * recall / (precision + recall):.2%}")
Production DeploymentΒΆ
import time
from collections import deque
class ProductionDetector:
"""Production-ready object detector"""
def __init__(self, model_name: str = "yolov8n"):
self.model_name = model_name
self.detector = ObjectDetector()
self.stats = {
"total_images": 0,
"total_detections": 0,
"avg_inference_time": 0,
"fps_history": deque(maxlen=30)
}
def detect_with_timing(self, image: np.ndarray) -> Tuple[List[BoundingBox], float]:
"""Detect with performance tracking"""
start = time.time()
detections = self.detector.detect(image)
inference_time = time.time() - start
# Update stats
self.stats["total_images"] += 1
self.stats["total_detections"] += len(detections)
self.stats["fps_history"].append(1 / inference_time if inference_time > 0 else 0)
self.stats["avg_inference_time"] = (
(self.stats["avg_inference_time"] * (self.stats["total_images"] - 1) + inference_time)
/ self.stats["total_images"]
)
return detections, inference_time
def get_performance_stats(self) -> dict:
"""Get performance statistics"""
avg_fps = np.mean(self.stats["fps_history"]) if self.stats["fps_history"] else 0
return {
"total_images": self.stats["total_images"],
"total_detections": self.stats["total_detections"],
"avg_detections_per_image": (
self.stats["total_detections"] / max(self.stats["total_images"], 1)
),
"avg_inference_time_ms": self.stats["avg_inference_time"] * 1000,
"avg_fps": avg_fps
}
# Test production detector
prod_detector = ProductionDetector()
# Process images
for i in range(10):
image = np.zeros((640, 640, 3), dtype=np.uint8)
detections, time_ms = prod_detector.detect_with_timing(image)
# Print stats
stats = prod_detector.get_performance_stats()
print("\nPerformance Statistics:")
print(f" Total Images: {stats['total_images']}")
print(f" Total Detections: {stats['total_detections']}")
print(f" Avg Detections/Image: {stats['avg_detections_per_image']:.1f}")
print(f" Avg Inference Time: {stats['avg_inference_time_ms']:.2f}ms")
print(f" Avg FPS: {stats['avg_fps']:.1f}")
# Anchor Generation and Assignment
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Tuple
# ============================================================
# 1. Anchor Box Generation (Faster R-CNN style)
# ============================================================
def generate_anchors(base_size: int = 16,
ratios: List[float] = [0.5, 1.0, 2.0],
scales: List[int] = [8, 16, 32]) -> np.ndarray:
"""
Generate anchor boxes with different scales and aspect ratios.
Parameters:
-----------
base_size : Base anchor size
ratios : Aspect ratios (h/w)
scales : Scales relative to base_size
Returns:
--------
anchors : (k, 4) array of anchors in (x_min, y_min, x_max, y_max) format
"""
anchors = []
for scale in scales:
for ratio in ratios:
# Compute width and height
h = base_size * scale * np.sqrt(ratio)
w = base_size * scale / np.sqrt(ratio)
# Center at (0, 0)
x_min = -w / 2
y_min = -h / 2
x_max = w / 2
y_max = h / 2
anchors.append([x_min, y_min, x_max, y_max])
return np.array(anchors)
# Generate default anchors
anchors = generate_anchors(base_size=16, ratios=[0.5, 1.0, 2.0], scales=[8, 16, 32])
print("="*70)
print("ANCHOR BOX GENERATION")
print("="*70)
print(f"Generated {len(anchors)} anchors:")
print(f" 3 aspect ratios Γ 3 scales = 9 anchors per position")
print("\nAnchor dimensions (width Γ height):")
for idx, anchor in enumerate(anchors):
w = anchor[2] - anchor[0]
h = anchor[3] - anchor[1]
ratio = h / w
print(f" Anchor {idx+1}: {w:6.1f} Γ {h:6.1f} (ratio: {ratio:.2f})")
# ============================================================
# 2. K-Means Anchor Clustering (YOLO style)
# ============================================================
def kmeans_anchors(boxes: np.ndarray, k: int = 9, max_iters: int = 100) -> np.ndarray:
"""
Run k-means clustering on box dimensions using IoU distance.
Parameters:
-----------
boxes : (n, 2) array of (width, height)
k : Number of clusters
Returns:
--------
anchors : (k, 2) array of anchor (width, height)
"""
n = boxes.shape[0]
# Random initialization
np.random.seed(42)
anchors = boxes[np.random.choice(n, k, replace=False)]
def iou_wh(wh1, wh2):
"""IoU for width-height pairs (assuming aligned at center)"""
w1, h1 = wh1
w2, h2 = wh2
inter = np.minimum(w1, w2) * np.minimum(h1, h2)
union = w1 * h1 + w2 * h2 - inter
return inter / (union + 1e-7)
for iteration in range(max_iters):
# Assign boxes to nearest anchor
distances = np.zeros((n, k))
for i, box in enumerate(boxes):
for j, anchor in enumerate(anchors):
distances[i, j] = 1 - iou_wh(box, anchor) # Distance = 1 - IoU
assignments = np.argmin(distances, axis=1)
# Update anchors
new_anchors = np.zeros((k, 2))
for j in range(k):
cluster_boxes = boxes[assignments == j]
if len(cluster_boxes) > 0:
new_anchors[j] = cluster_boxes.mean(axis=0)
else:
new_anchors[j] = anchors[j] # Keep old if no assignment
# Check convergence
if np.allclose(anchors, new_anchors):
break
anchors = new_anchors
# Sort by area
areas = anchors[:, 0] * anchors[:, 1]
sorted_indices = np.argsort(areas)
anchors = anchors[sorted_indices]
return anchors
# Simulate COCO-like box distribution
np.random.seed(42)
n_boxes = 1000
# Generate realistic box distributions
# Small objects (people, animals): 30-100 pixels
small_boxes = np.random.uniform(30, 100, (400, 2))
# Medium objects (cars, furniture): 100-250 pixels
medium_boxes = np.random.uniform(100, 250, (400, 2))
# Large objects (buildings, scenes): 250-500 pixels
large_boxes = np.random.uniform(250, 500, (200, 2))
all_boxes = np.vstack([small_boxes, medium_boxes, large_boxes])
# Run k-means
learned_anchors = kmeans_anchors(all_boxes, k=9, max_iters=50)
print("\n" + "="*70)
print("K-MEANS ANCHOR LEARNING (YOLO-style)")
print("="*70)
print(f"Learned {len(learned_anchors)} anchors from {len(all_boxes)} boxes:")
print("\nAnchor dimensions (width Γ height):")
for idx, (w, h) in enumerate(learned_anchors):
ratio = h / w
area = w * h
print(f" Anchor {idx+1}: {w:6.1f} Γ {h:6.1f} (ratio: {ratio:.2f}, area: {area:8.0f})")
# ============================================================
# 3. Anchor Assignment Strategy
# ============================================================
def assign_anchors_to_gt(gt_boxes: np.ndarray,
anchors: np.ndarray,
pos_iou_thresh: float = 0.7,
neg_iou_thresh: float = 0.3) -> Tuple[np.ndarray, np.ndarray]:
"""
Assign anchors to ground truth boxes (Faster R-CNN strategy).
Parameters:
-----------
gt_boxes : (m, 4) ground truth boxes [x_min, y_min, x_max, y_max]
anchors : (n, 4) anchor boxes
Returns:
--------
labels : (n,) array {-1: ignore, 0: background, 1: object}
targets : (n, 4) box regression targets
"""
n_anchors = len(anchors)
n_gt = len(gt_boxes)
labels = -np.ones(n_anchors, dtype=np.int32) # -1 = ignore
targets = np.zeros((n_anchors, 4))
if n_gt == 0:
labels[:] = 0 # All background
return labels, targets
# Compute IoU matrix
ious = np.zeros((n_anchors, n_gt))
for i, anchor in enumerate(anchors):
for j, gt in enumerate(gt_boxes):
# Compute IoU (simplified for demonstration)
x_min = max(anchor[0], gt[0])
y_min = max(anchor[1], gt[1])
x_max = min(anchor[2], gt[2])
y_max = min(anchor[3], gt[3])
inter = max(0, x_max - x_min) * max(0, y_max - y_min)
area_a = (anchor[2] - anchor[0]) * (anchor[3] - anchor[1])
area_g = (gt[2] - gt[0]) * (gt[3] - gt[1])
union = area_a + area_g - inter
ious[i, j] = inter / (union + 1e-7)
# Assign labels
max_iou_per_anchor = ious.max(axis=1)
max_gt_per_anchor = ious.argmax(axis=1)
# Rule 1: IoU > pos_thresh β positive
labels[max_iou_per_anchor >= pos_iou_thresh] = 1
# Rule 2: IoU < neg_thresh β negative
labels[max_iou_per_anchor < neg_iou_thresh] = 0
# Rule 3: For each GT, assign anchor with highest IoU
max_iou_per_gt = ious.max(axis=0)
for j in range(n_gt):
best_anchor = ious[:, j].argmax()
labels[best_anchor] = 1
max_gt_per_anchor[best_anchor] = j
# Compute box regression targets (for positive anchors)
for i in range(n_anchors):
if labels[i] == 1:
anchor = anchors[i]
gt = gt_boxes[max_gt_per_anchor[i]]
# Parameterized offsets
ax_ctr = (anchor[0] + anchor[2]) / 2
ay_ctr = (anchor[1] + anchor[3]) / 2
aw = anchor[2] - anchor[0]
ah = anchor[3] - anchor[1]
gx_ctr = (gt[0] + gt[2]) / 2
gy_ctr = (gt[1] + gt[3]) / 2
gw = gt[2] - gt[0]
gh = gt[3] - gt[1]
targets[i, 0] = (gx_ctr - ax_ctr) / aw
targets[i, 1] = (gy_ctr - ay_ctr) / ah
targets[i, 2] = np.log(gw / aw)
targets[i, 3] = np.log(gh / ah)
return labels, targets
# Test anchor assignment
test_gt = np.array([[100, 100, 200, 200], [300, 150, 450, 300]])
test_anchors = np.array([
[90, 90, 210, 210], # High IoU with GT1
[150, 150, 250, 250], # Medium IoU with GT1
[295, 145, 455, 305], # High IoU with GT2
[500, 500, 550, 550], # No overlap (background)
])
labels, targets = assign_anchors_to_gt(test_gt, test_anchors, pos_iou_thresh=0.7, neg_iou_thresh=0.3)
print("\n" + "="*70)
print("ANCHOR ASSIGNMENT EXAMPLE")
print("="*70)
print(f"Ground Truth boxes: {len(test_gt)}")
print(f"Anchors: {len(test_anchors)}")
print("\nAssignment results:")
for i, (label, target) in enumerate(zip(labels, targets)):
status = {-1: "IGNORE", 0: "BACKGROUND", 1: "OBJECT"}[label]
print(f" Anchor {i+1}: {status:12s}", end="")
if label == 1:
print(f" β targets: ({target[0]:6.3f}, {target[1]:6.3f}, {target[2]:6.3f}, {target[3]:6.3f})")
else:
print()
print("\n" + "="*70)
print("ASSIGNMENT RULES")
print("="*70)
print("1. IoU β₯ 0.7 with any GT β POSITIVE (object)")
print("2. IoU < 0.3 with all GT β NEGATIVE (background)")
print("3. 0.3 β€ IoU < 0.7 β IGNORE (ambiguous, don't train)")
print("4. Best anchor for each GT β POSITIVE (ensure every GT has match)")
print("="*70)
Data Augmentation for Object DetectionΒΆ
Data augmentation is crucial for training robust object detectors. Unlike classification, augmentation must transform both images AND bounding box annotations consistently.
1. Geometric AugmentationsΒΆ
Random Horizontal FlipΒΆ
Operation: Flip image left-right
Box transformation: \(x' = W - x\), where \(W\) is image width
Use case: Objects with horizontal symmetry (cars, people)
Probability: Typically 0.5
Random Scaling and TranslationΒΆ
Scale: Resize image by factor \(s \in [s_{min}, s_{max}]\)
Translate: Shift by \((\Delta x, \Delta y)\)
Box update: $\(x_{new} = s \cdot x + \Delta x\)\( \)\(y_{new} = s \cdot y + \Delta y\)$
Typical ranges: \(s \in [0.8, 1.2]\), \(|\Delta| \leq 0.1W\)
Random RotationΒΆ
Operation: Rotate by angle \(\theta\)
Box transformation: Compute corners, rotate, find new axis-aligned bbox
Challenge: Bounding box becomes larger after rotation
Alternative: Use oriented bounding boxes (OBB)
2. Mosaic Augmentation (YOLO v4+)ΒΆ
Combines 4 images into one mosaic:
Procedure:
Sample 4 images
Resize to random scales
Place at 4 quadrants with random center point
Adjust all bounding boxes to new coordinates
Benefits:
Exposes model to more objects per batch
Forces model to learn from different scales simultaneously
Improves small object detection
Reduces batch normalization artifacts
3. MixUp for Object DetectionΒΆ
Blend two images with ratio \(\lambda \sim Beta(\alpha, \alpha)\):
Modifications for detection:
Keep all bounding boxes from both images
Optional: Weight box confidence by \(\lambda\)
Typical \(\alpha = 1.5\) (vs. \(\alpha = 1.0\) in classification)
4. Copy-Paste AugmentationΒΆ
From instance segmentation masks:
Extract object from image 1 using mask
Paste onto image 2 at random location
Add bounding box to image 2 annotations
Advanced: Use Poisson blending for seamless integration
5. Color/Photometric AugmentationsΒΆ
These donβt affect bounding boxes:
Augmentation |
Operation |
Range |
|---|---|---|
Brightness |
\(I' = I + \beta\) |
\(\beta \in [-30, 30]\) |
Contrast |
\(I' = \alpha I\) |
\(\alpha \in [0.8, 1.2]\) |
Saturation |
Adjust in HSV space |
\(\times [0.7, 1.3]\) |
Hue |
Shift hue channel |
\(\pm 10Β°\) |
HSV transformation: $\(I_{HSV}' = \begin{bmatrix} H + \Delta H \\ \alpha_S \cdot S \\ \alpha_V \cdot V \end{bmatrix}\)$
6. Random Crop and PaddingΒΆ
IoU-based Cropping (SSD-style):ΒΆ
Sample crop with IoU \(\in \{0.1, 0.3, 0.5, 0.7, 0.9, 1.0\}\) with some GT box
Reject if crop doesnβt contain any object center
Adjust boxes: clip to crop boundaries, remove if center outside
Padding:ΒΆ
Add borders: top, bottom, left, right
Keep boxes unchanged (still valid in larger canvas)
Useful for preserving aspect ratio
7. Advanced TechniquesΒΆ
CutOut / Random ErasingΒΆ
Randomly mask rectangular regions
For detection: Avoid erasing object centers
Forces model to use partial information
AutoAugment for DetectionΒΆ
Learn augmentation policy via RL:
Search space: 20+ operations with magnitude
Optimize for mAP on validation set
Computationally expensive but effective
Test-Time Augmentation (TTA)ΒΆ
At inference:
Apply multiple augmentations to input
Run detector on each
Aggregate predictions (NMS across all)
8. Augmentation PipelineΒΆ
Training loop:
For each image:
1. Mosaic (prob=0.5)
2. Random scale (0.5-1.5Γ)
3. Random flip (prob=0.5)
4. HSV adjustment (always)
5. MixUp (prob=0.1)
6. Random crop (IoU-based)
7. Normalize and resize to input size
Key principles:
Always maintain box-image consistency
Discard boxes with area < threshold (e.g., 16 pixels)
Clip boxes to image boundaries
Remove empty boxes (IoU with image = 0)
9. Augmentation Strength SchedulingΒΆ
Warm-up phase (epochs 1-10):
Mild augmentation: flip, small scale, color jitter
Helps stable training start
Main phase (epochs 10-270):
Full augmentation: mosaic, mixup, aggressive scaling
Fine-tuning phase (epochs 270-300):
Disable mosaic/mixup
Only flip and mild scaling
Helps model adapt to real data distribution
10. Domain-Specific ConsiderationsΒΆ
Domain |
Key Augmentations |
Avoid |
|---|---|---|
Autonomous Driving |
Flip, scale, cutout, weather effects |
Rotation (road is horizontal) |
Retail (shelf detection) |
Scale, brightness, crop |
Flip (text orientation matters) |
Aerial Imagery |
Rotation, flip, scale |
Extreme crops (context matters) |
Medical Imaging |
Rotation, flip, elastic deformation |
Color jitter (preserves diagnostic info) |
Best practice: Analyze failure cases, add targeted augmentations to address them.
# Data Augmentation Implementations
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Tuple
import cv2
# ============================================================
# 1. Geometric Augmentations with Box Updates
# ============================================================
def horizontal_flip(image: np.ndarray, boxes: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
"""
Flip image horizontally and update boxes.
Parameters:
-----------
image : (H, W, C) image array
boxes : (n, 4) boxes in [x_min, y_min, x_max, y_max] format
Returns:
--------
flipped_image, flipped_boxes
"""
H, W = image.shape[:2]
# Flip image
flipped_image = np.fliplr(image)
# Update boxes: x' = W - x
flipped_boxes = boxes.copy()
flipped_boxes[:, [0, 2]] = W - boxes[:, [2, 0]] # Swap and flip x coordinates
return flipped_image, flipped_boxes
def random_scale_translate(image: np.ndarray,
boxes: np.ndarray,
scale_range: Tuple[float, float] = (0.8, 1.2),
translate_range: float = 0.1) -> Tuple[np.ndarray, np.ndarray]:
"""
Randomly scale and translate image and boxes.
"""
H, W = image.shape[:2]
# Random scale and translation
scale = np.random.uniform(*scale_range)
tx = np.random.uniform(-translate_range * W, translate_range * W)
ty = np.random.uniform(-translate_range * H, translate_range * H)
# Transformation matrix
M = np.array([[scale, 0, tx],
[0, scale, ty]])
# Transform image
new_H, new_W = int(H * scale), int(W * scale)
transformed = cv2.warpAffine(image, M, (new_W, new_H))
# Transform boxes
transformed_boxes = boxes.copy()
transformed_boxes[:, [0, 2]] = transformed_boxes[:, [0, 2]] * scale + tx
transformed_boxes[:, [1, 3]] = transformed_boxes[:, [1, 3]] * scale + ty
# Clip to image boundaries
transformed_boxes[:, [0, 2]] = np.clip(transformed_boxes[:, [0, 2]], 0, new_W)
transformed_boxes[:, [1, 3]] = np.clip(transformed_boxes[:, [1, 3]], 0, new_H)
return transformed, transformed_boxes
# ============================================================
# 2. Mosaic Augmentation
# ============================================================
def create_mosaic(images: List[np.ndarray],
boxes_list: List[np.ndarray],
output_size: int = 640) -> Tuple[np.ndarray, np.ndarray]:
"""
Create mosaic from 4 images (YOLO-style).
Parameters:
-----------
images : List of 4 images
boxes_list : List of 4 box arrays (each n_i Γ 4)
output_size : Output mosaic size
Returns:
--------
mosaic_image, mosaic_boxes
"""
assert len(images) == 4, "Need exactly 4 images for mosaic"
# Random center point
cx = np.random.randint(output_size // 4, 3 * output_size // 4)
cy = np.random.randint(output_size // 4, 3 * output_size // 4)
mosaic = np.zeros((output_size, output_size, 3), dtype=np.uint8)
mosaic_boxes = []
# Quadrant offsets: top-left, top-right, bottom-left, bottom-right
quadrants = [
(0, 0, cx, cy), # Top-left
(cx, 0, output_size, cy), # Top-right
(0, cy, cx, output_size), # Bottom-left
(cx, cy, output_size, output_size) # Bottom-right
]
for idx, (img, boxes) in enumerate(zip(images, boxes_list)):
x1, y1, x2, y2 = quadrants[idx]
quad_w, quad_h = x2 - x1, y2 - y1
# Resize image to fit quadrant
img_resized = cv2.resize(img, (quad_w, quad_h))
# Place in mosaic
mosaic[y1:y2, x1:x2] = img_resized
# Transform boxes
if len(boxes) > 0:
H, W = img.shape[:2]
scale_x = quad_w / W
scale_y = quad_h / H
transformed_boxes = boxes.copy()
transformed_boxes[:, [0, 2]] = boxes[:, [0, 2]] * scale_x + x1
transformed_boxes[:, [1, 3]] = boxes[:, [1, 3]] * scale_y + y1
# Clip to mosaic boundaries
transformed_boxes[:, [0, 2]] = np.clip(transformed_boxes[:, [0, 2]], 0, output_size)
transformed_boxes[:, [1, 3]] = np.clip(transformed_boxes[:, [1, 3]], 0, output_size)
mosaic_boxes.append(transformed_boxes)
mosaic_boxes = np.vstack(mosaic_boxes) if mosaic_boxes else np.zeros((0, 4))
return mosaic, mosaic_boxes
# ============================================================
# 3. MixUp for Detection
# ============================================================
def mixup_detection(image1: np.ndarray, boxes1: np.ndarray,
image2: np.ndarray, boxes2: np.ndarray,
alpha: float = 1.5) -> Tuple[np.ndarray, np.ndarray]:
"""
Apply MixUp to two images and combine their boxes.
"""
# Sample mixing ratio
lam = np.random.beta(alpha, alpha)
# Mix images
mixed_image = (lam * image1 + (1 - lam) * image2).astype(np.uint8)
# Combine boxes
mixed_boxes = np.vstack([boxes1, boxes2]) if len(boxes1) > 0 and len(boxes2) > 0 else boxes1
return mixed_image, mixed_boxes
# ============================================================
# 4. Color Jittering
# ============================================================
def color_jitter(image: np.ndarray,
brightness: float = 30,
contrast: Tuple[float, float] = (0.8, 1.2),
saturation: Tuple[float, float] = (0.7, 1.3),
hue: float = 10) -> np.ndarray:
"""
Apply random color jittering in HSV space.
"""
# Convert to HSV
hsv = cv2.cvtColor(image, cv2.COLOR_RGB2HSV).astype(np.float32)
# Hue shift
hsv[:, :, 0] += np.random.uniform(-hue, hue)
hsv[:, :, 0] = np.clip(hsv[:, :, 0], 0, 179)
# Saturation scaling
hsv[:, :, 1] *= np.random.uniform(*saturation)
hsv[:, :, 1] = np.clip(hsv[:, :, 1], 0, 255)
# Value (brightness) scaling
hsv[:, :, 2] *= np.random.uniform(*contrast)
hsv[:, :, 2] = np.clip(hsv[:, :, 2], 0, 255)
# Convert back to RGB
jittered = cv2.cvtColor(hsv.astype(np.uint8), cv2.COLOR_HSV2RGB)
return jittered
# ============================================================
# 5. Demonstration
# ============================================================
# Create synthetic images with boxes
def create_test_image(color, box):
"""Helper to create test image with one box"""
img = np.full((300, 300, 3), color, dtype=np.uint8)
x1, y1, x2, y2 = box
cv2.rectangle(img, (int(x1), int(y1)), (int(x2), int(y2)), (255, 255, 255), 2)
return img
# Test images
img1 = create_test_image([200, 100, 100], [50, 50, 150, 150])
boxes1 = np.array([[50, 50, 150, 150]])
img2 = create_test_image([100, 200, 100], [180, 180, 280, 280])
boxes2 = np.array([[180, 180, 280, 280]])
img3 = create_test_image([100, 100, 200], [60, 120, 160, 220])
boxes3 = np.array([[60, 120, 160, 220]])
img4 = create_test_image([200, 200, 100], [100, 50, 250, 150])
boxes4 = np.array([[100, 50, 250, 150]])
# Apply augmentations
flipped_img, flipped_boxes = horizontal_flip(img1, boxes1)
mosaic_img, mosaic_boxes = create_mosaic([img1, img2, img3, img4],
[boxes1, boxes2, boxes3, boxes4],
output_size=400)
mixed_img, mixed_boxes = mixup_detection(img1, boxes1, img2, boxes2, alpha=1.5)
jittered_img = color_jitter(img1.copy())
# Visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Object Detection Data Augmentation Examples', fontsize=16, fontweight='bold')
# Original
axes[0, 0].imshow(img1)
axes[0, 0].add_patch(plt.Rectangle((boxes1[0, 0], boxes1[0, 1]),
boxes1[0, 2] - boxes1[0, 0],
boxes1[0, 3] - boxes1[0, 1],
fill=False, edgecolor='yellow', linewidth=2))
axes[0, 0].set_title('Original Image')
axes[0, 0].axis('off')
# Horizontal Flip
axes[0, 1].imshow(flipped_img)
axes[0, 1].add_patch(plt.Rectangle((flipped_boxes[0, 0], flipped_boxes[0, 1]),
flipped_boxes[0, 2] - flipped_boxes[0, 0],
flipped_boxes[0, 3] - flipped_boxes[0, 1],
fill=False, edgecolor='yellow', linewidth=2))
axes[0, 1].set_title('Horizontal Flip')
axes[0, 1].axis('off')
# Mosaic
axes[0, 2].imshow(mosaic_img)
for box in mosaic_boxes:
axes[0, 2].add_patch(plt.Rectangle((box[0], box[1]), box[2] - box[0], box[3] - box[1],
fill=False, edgecolor='yellow', linewidth=2))
axes[0, 2].set_title(f'Mosaic ({len(mosaic_boxes)} boxes)')
axes[0, 2].axis('off')
# MixUp
axes[1, 0].imshow(mixed_img)
for box in mixed_boxes:
axes[1, 0].add_patch(plt.Rectangle((box[0], box[1]), box[2] - box[0], box[3] - box[1],
fill=False, edgecolor='yellow', linewidth=2))
axes[1, 0].set_title(f'MixUp ({len(mixed_boxes)} boxes)')
axes[1, 0].axis('off')
# Color Jitter
axes[1, 1].imshow(jittered_img)
axes[1, 1].add_patch(plt.Rectangle((boxes1[0, 0], boxes1[0, 1]),
boxes1[0, 2] - boxes1[0, 0],
boxes1[0, 3] - boxes1[0, 1],
fill=False, edgecolor='yellow', linewidth=2))
axes[1, 1].set_title('Color Jitter (HSV)')
axes[1, 1].axis('off')
# Statistics
axes[1, 2].axis('off')
stats_text = f"""
AUGMENTATION STATISTICS
Original: {len(boxes1)} box
Flip: {len(flipped_boxes)} box
Mosaic: {len(mosaic_boxes)} boxes
MixUp: {len(mixed_boxes)} boxes
KEY INSIGHTS:
β’ Mosaic combines 4 images
β’ All boxes preserved
β’ Coordinates transformed
β’ Boundaries clipped
RECOMMENDATIONS:
β Flip: 50% probability
β Mosaic: 50% in training
β MixUp: 10% probability
β Color: Always apply
β Disable mosaic in last
10 epochs for fine-tuning
"""
axes[1, 2].text(0.1, 0.5, stats_text, fontsize=11, verticalalignment='center',
family='monospace', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.3))
plt.tight_layout()
plt.show()
print("="*70)
print("DATA AUGMENTATION SUMMARY")
print("="*70)
print(f"β Horizontal flip: Boxes transformed correctly")
print(f"β Mosaic: Combined {len(mosaic_boxes)} boxes from 4 images")
print(f"β MixUp: Merged {len(mixed_boxes)} boxes")
print(f"β Color jitter: No box transformation needed")
print("="*70)
Best PracticesΒΆ
1. Model SelectionΒΆ
Real-time (>30 FPS): YOLOv8n, YOLOv8s
Balanced: YOLOv8m, DETR
High accuracy: YOLOv8x, Faster R-CNN
Edge devices: YOLOv8n with TensorRT
2. HyperparametersΒΆ
Confidence threshold: 0.25-0.5 (lower = more detections)
IoU threshold (NMS): 0.45-0.65 (lower = fewer duplicates)
Image size: 640Γ640 (YOLO), can go lower for speed
3. Training TipsΒΆ
Use data augmentation (mosaic, mixup)
Balance classes with weighted sampling
Train on high-resolution images
Freeze backbone initially, then fine-tune
4. OptimizationΒΆ
Convert to ONNX/TensorRT for faster inference
Batch processing when possible
Resize images to smaller sizes (320Γ320, 416Γ416)
Use FP16 precision on GPUs
Common Use CasesΒΆ
Autonomous vehicles: Detect pedestrians, cars, traffic signs
Surveillance: People counting, intrusion detection
Retail: Product detection, shelf monitoring
Manufacturing: Defect detection, quality control
Agriculture: Crop monitoring, pest detection
Key TakeawaysΒΆ
β Object detection = Classification + Localization
β YOLO is the best choice for real-time applications
β NMS removes duplicate detections
β IoU measures box overlap (used in NMS and evaluation)
β mAP (mean Average Precision) is the standard metric
β Balance speed vs accuracy based on use case
Next: 03_clip_embeddings.ipynb - Multimodal understanding