Computer Vision Specialization β Start HereΒΆ
Overview of the Computer Vision track: image classification, object detection, CLIP embeddings, Stable Diffusion, and multimodal RAG.
# Install dependencies
# !pip install torch torchvision transformers pillow matplotlib
Quick Start: Image ClassificationΒΆ
from PIL import Image
import requests
from io import BytesIO
# Load a sample image
def load_image_from_url(url: str) -> Image.Image:
"""Load image from URL"""
response = requests.get(url)
img = Image.open(BytesIO(response.content))
return img
# Sample image URLs (replace with your own)
sample_images = {
"cat": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/400px-Cat03.jpg",
"dog": "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a3/June_odd-eyed-cat.jpg/300px-June_odd-eyed-cat.jpg"
}
# For this demo, we'll create a placeholder
print("Image loading utility ready")
print("In production, load real images using PIL or opencv")
Simple Classification with TransformersΒΆ
# Example with Hugging Face Transformers (requires installation)
'''
from transformers import pipeline
# Load image classification model
classifier = pipeline("image-classification", model="microsoft/resnet-50")
# Classify an image
image = load_image_from_url(sample_images["cat"])
results = classifier(image)
print("Top predictions:")
for result in results[:3]:
print(f" {result['label']}: {result['score']:.2%}")
'''
print("Image classification example (commented - requires transformers)")
print("\nTypical output:")
print(" Egyptian cat: 87.32%")
print(" Tabby cat: 8.45%")
print(" Tiger cat: 2.11%")
Understanding Image PreprocessingΒΆ
import numpy as np
from typing import Tuple
class ImagePreprocessor:
"""Preprocess images for model input"""
def __init__(self, target_size: Tuple[int, int] = (224, 224)):
self.target_size = target_size
def resize(self, image: Image.Image) -> Image.Image:
"""Resize to target size"""
return image.resize(self.target_size, Image.Resampling.LANCZOS)
def normalize(self, image_array: np.ndarray) -> np.ndarray:
"""Normalize pixel values to [0, 1]"""
return image_array.astype(np.float32) / 255.0
def standardize(self, image_array: np.ndarray,
mean: Tuple[float, float, float] = (0.485, 0.456, 0.406),
std: Tuple[float, float, float] = (0.229, 0.224, 0.225)) -> np.ndarray:
"""Standardize using ImageNet statistics"""
mean_array = np.array(mean).reshape(1, 1, 3)
std_array = np.array(std).reshape(1, 1, 3)
return (image_array - mean_array) / std_array
def to_tensor(self, image_array: np.ndarray) -> np.ndarray:
"""Convert to CHW format (channels first)"""
# From HWC (Height, Width, Channels) to CHW
return np.transpose(image_array, (2, 0, 1))
def preprocess(self, image: Image.Image) -> np.ndarray:
"""Full preprocessing pipeline"""
# 1. Resize
img_resized = self.resize(image)
# 2. Convert to array
img_array = np.array(img_resized)
# 3. Normalize to [0, 1]
img_normalized = self.normalize(img_array)
# 4. Standardize (ImageNet stats)
img_standardized = self.standardize(img_normalized)
# 5. Convert to CHW format
img_tensor = self.to_tensor(img_standardized)
return img_tensor
# Test preprocessor
preprocessor = ImagePreprocessor(target_size=(224, 224))
# Create a dummy image
dummy_image = Image.new('RGB', (500, 500), color='red')
processed = preprocessor.preprocess(dummy_image)
print(f"Original image size: {dummy_image.size}")
print(f"Processed tensor shape: {processed.shape}") # Should be (3, 224, 224)
print(f"Value range: [{processed.min():.2f}, {processed.max():.2f}]")
Computer Vision PipelineΒΆ
A typical CV application follows this flow:
βββββββββββββββ
β Input Image β
ββββββββ¬βββββββ
β
βββββββββββββββββββ
β Preprocessing β β Resize, normalize, augment
ββββββββ¬βββββββββββ
β
βββββββββββββββββββ
β Feature Extract β β CNN or ViT backbone
ββββββββ¬βββββββββββ
β
βββββββββββββββββββ
β Task-Specific β β Classification/Detection/etc.
β Head β
ββββββββ¬βββββββββββ
β
βββββββββββββββββββ
β Post-processing β β NMS, thresholding
ββββββββ¬βββββββββββ
β
βββββββββββββββββββ
β Output β β Labels, boxes, masks
βββββββββββββββββββ
Modern CV ArchitecturesΒΆ
1. Convolutional Neural Networks (CNNs)ΒΆ
ResNet: Skip connections, 50-152 layers
EfficientNet: Balanced scaling (width, depth, resolution)
MobileNet: Lightweight for mobile devices
2. Vision Transformers (ViT)ΒΆ
Split image into patches
Apply self-attention
Better for large datasets
3. Hybrid ModelsΒΆ
CLIP: Text + Vision understanding
DINO: Self-supervised learning
SAM: Segment Anything Model
# Popular model choices for different tasks
model_recommendations = {
"Classification": [
"ResNet-50 (fast, accurate)",
"EfficientNet-B0 (efficient)",
"ViT-Base (transformer-based)"
],
"Object Detection": [
"YOLOv8 (real-time)",
"DETR (transformer-based)",
"Faster R-CNN (accurate)"
],
"Segmentation": [
"SAM (Segment Anything)",
"Mask R-CNN",
"U-Net (medical images)"
],
"Multimodal": [
"CLIP (OpenAI)",
"BLIP (Salesforce)",
"LLaVA (visual chatbot)"
],
"Generation": [
"Stable Diffusion",
"DALL-E 3",
"Midjourney"
]
}
print("π MODEL RECOMMENDATIONS BY TASK\n")
for task, models in model_recommendations.items():
print(f"\n{task}:")
for model in models:
print(f" β’ {model}")
Series RoadmapΒΆ
Module 1: Image ClassificationΒΆ
CNN architectures (ResNet, EfficientNet)
Transfer learning
Fine-tuning strategies
Data augmentation
Module 2: Object DetectionΒΆ
YOLO architecture
Bounding box prediction
Non-Maximum Suppression (NMS)
Real-time detection
Module 3: Image Embeddings (CLIP)ΒΆ
Multimodal embeddings
Zero-shot classification
Visual search
Text-to-image retrieval
Module 4: Stable DiffusionΒΆ
Diffusion models
Text-to-image generation
Image-to-image translation
ControlNet and LoRA
Module 5: Multimodal RAGΒΆ
Visual question answering
Image + text retrieval
Document understanding
Multimodal chatbots
Module 6: Production DeploymentΒΆ
Model optimization (ONNX, TensorRT)
Batch processing
API deployment
Monitoring and scaling
Key ConceptsΒΆ
1. Transfer LearningΒΆ
Using pre-trained models and fine-tuning for your task:
# Load pre-trained model
model = ResNet50(weights='imagenet')
# Freeze base layers
for param in model.parameters():
param.requires_grad = False
# Replace final layer for your task
model.fc = nn.Linear(2048, num_classes)
2. Data AugmentationΒΆ
Increase training data diversity:
Random crops and flips
Color jittering
Rotation and scaling
Cutout and mixup
3. EmbeddingsΒΆ
Fixed-size vector representations:
# Extract features
features = model.encode_image(image) # Shape: (1, 512)
# Compare images
similarity = cosine_similarity(features1, features2)
4. Zero-Shot LearningΒΆ
Classify without training examples:
# CLIP can classify ANY category
labels = ["cat", "dog", "car", "airplane"]
predictions = model(image, labels) # No training needed!
Real-World ApplicationsΒΆ
E-commerceΒΆ
Visual search (βfind similar productsβ)
Auto-tagging products
Quality inspection
Virtual try-on
HealthcareΒΆ
Medical image analysis
Disease detection
Radiology assistance
Pathology classification
Autonomous VehiclesΒΆ
Object detection (pedestrians, vehicles)
Lane detection
Traffic sign recognition
Depth estimation
Content ModerationΒΆ
NSFW detection
Violence detection
Brand safety
Copyright detection
Creative ToolsΒΆ
AI art generation
Photo editing
Style transfer
Image restoration
Whatβs Next?ΒΆ
Notebook 1: Image ClassificationΒΆ
Build image classifiers with ResNet and Vision Transformers
Notebook 2: Object DetectionΒΆ
Detect and locate objects with YOLO
Notebook 3: CLIP EmbeddingsΒΆ
Multimodal understanding with text and images
Notebook 4: Stable DiffusionΒΆ
Generate images from text prompts
Notebook 5: Multimodal RAGΒΆ
Build visual question answering systems
Notebook 6: ProductionΒΆ
Deploy CV models at scale
ResourcesΒΆ
Papers:
βDeep Residual Learningβ (ResNet, 2015)
βAttention Is All You Needβ (Transformers, 2017)
βLearning Transferable Visual Models From Natural Language Supervisionβ (CLIP, 2021)
βHigh-Resolution Image Synthesis with Latent Diffusion Modelsβ (Stable Diffusion, 2022)
Tools:
Hugging Face Transformers
PyTorch Vision (torchvision)
Ultralytics YOLO
Diffusers library
Datasets:
ImageNet (1.2M images, 1000 classes)
COCO (object detection)
OpenImages (9M images)
LAION-5B (for CLIP/Stable Diffusion)
Ready to start? β 01_image_classification.ipynb