Run this notebook: Open in Colab Open in Kaggle

Computer Vision Specialization — Start Here¶

Overview of the Computer Vision track: image classification, object detection, CLIP embeddings, Stable Diffusion, and multimodal RAG.

# Install dependencies
# !pip install torch torchvision transformers pillow matplotlib

Quick Start: Image Classification¶

from PIL import Image
import requests
from io import BytesIO

# Load a sample image
def load_image_from_url(url: str) -> Image.Image:
    """Load image from URL"""
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))
    return img

# Sample image URLs (replace with your own)
sample_images = {
    "cat": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/400px-Cat03.jpg",
    "dog": "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a3/June_odd-eyed-cat.jpg/300px-June_odd-eyed-cat.jpg"
}

# For this demo, we'll create a placeholder
print("Image loading utility ready")
print("In production, load real images using PIL or opencv")

Simple Classification with Transformers¶

# Example with Hugging Face Transformers (requires installation)
'''
from transformers import pipeline

# Load image classification model
classifier = pipeline("image-classification", model="microsoft/resnet-50")

# Classify an image
image = load_image_from_url(sample_images["cat"])
results = classifier(image)

print("Top predictions:")
for result in results[:3]:
    print(f"  {result['label']}: {result['score']:.2%}")
'''

print("Image classification example (commented - requires transformers)")
print("\nTypical output:")
print("  Egyptian cat: 87.32%")
print("  Tabby cat: 8.45%")
print("  Tiger cat: 2.11%")

Understanding Image Preprocessing¶

import numpy as np
from typing import Tuple

class ImagePreprocessor:
    """Preprocess images for model input"""
    
    def __init__(self, target_size: Tuple[int, int] = (224, 224)):
        self.target_size = target_size
    
    def resize(self, image: Image.Image) -> Image.Image:
        """Resize to target size"""
        return image.resize(self.target_size, Image.Resampling.LANCZOS)
    
    def normalize(self, image_array: np.ndarray) -> np.ndarray:
        """Normalize pixel values to [0, 1]"""
        return image_array.astype(np.float32) / 255.0
    
    def standardize(self, image_array: np.ndarray, 
                    mean: Tuple[float, float, float] = (0.485, 0.456, 0.406),
                    std: Tuple[float, float, float] = (0.229, 0.224, 0.225)) -> np.ndarray:
        """Standardize using ImageNet statistics"""
        mean_array = np.array(mean).reshape(1, 1, 3)
        std_array = np.array(std).reshape(1, 1, 3)
        return (image_array - mean_array) / std_array
    
    def to_tensor(self, image_array: np.ndarray) -> np.ndarray:
        """Convert to CHW format (channels first)"""
        # From HWC (Height, Width, Channels) to CHW
        return np.transpose(image_array, (2, 0, 1))
    
    def preprocess(self, image: Image.Image) -> np.ndarray:
        """Full preprocessing pipeline"""
        # 1. Resize
        img_resized = self.resize(image)
        
        # 2. Convert to array
        img_array = np.array(img_resized)
        
        # 3. Normalize to [0, 1]
        img_normalized = self.normalize(img_array)
        
        # 4. Standardize (ImageNet stats)
        img_standardized = self.standardize(img_normalized)
        
        # 5. Convert to CHW format
        img_tensor = self.to_tensor(img_standardized)
        
        return img_tensor

# Test preprocessor
preprocessor = ImagePreprocessor(target_size=(224, 224))

# Create a dummy image
dummy_image = Image.new('RGB', (500, 500), color='red')
processed = preprocessor.preprocess(dummy_image)

print(f"Original image size: {dummy_image.size}")
print(f"Processed tensor shape: {processed.shape}")  # Should be (3, 224, 224)
print(f"Value range: [{processed.min():.2f}, {processed.max():.2f}]")

Computer Vision Pipeline¶

A typical CV application follows this flow:

┌─────────────┐
│ Input Image │
└──────┬──────┘
       ↓
┌─────────────────┐
│ Preprocessing   │ ← Resize, normalize, augment
└──────┬──────────┘
       ↓
┌─────────────────┐
│ Feature Extract │ ← CNN or ViT backbone
└──────┬──────────┘
       ↓
┌─────────────────┐
│ Task-Specific   │ ← Classification/Detection/etc.
│ Head            │
└──────┬──────────┘
       ↓
┌─────────────────┐
│ Post-processing │ ← NMS, thresholding
└──────┬──────────┘
       ↓
┌─────────────────┐
│ Output          │ ← Labels, boxes, masks
└─────────────────┘

Modern CV Architectures¶

1. Convolutional Neural Networks (CNNs)¶

ResNet: Skip connections, 50-152 layers
EfficientNet: Balanced scaling (width, depth, resolution)
MobileNet: Lightweight for mobile devices

2. Vision Transformers (ViT)¶

Split image into patches
Apply self-attention
Better for large datasets

3. Hybrid Models¶

CLIP: Text + Vision understanding
DINO: Self-supervised learning
SAM: Segment Anything Model

# Popular model choices for different tasks
model_recommendations = {
    "Classification": [
        "ResNet-50 (fast, accurate)",
        "EfficientNet-B0 (efficient)",
        "ViT-Base (transformer-based)"
    ],
    "Object Detection": [
        "YOLOv8 (real-time)",
        "DETR (transformer-based)",
        "Faster R-CNN (accurate)"
    ],
    "Segmentation": [
        "SAM (Segment Anything)",
        "Mask R-CNN",
        "U-Net (medical images)"
    ],
    "Multimodal": [
        "CLIP (OpenAI)",
        "BLIP (Salesforce)",
        "LLaVA (visual chatbot)"
    ],
    "Generation": [
        "Stable Diffusion",
        "DALL-E 3",
        "Midjourney"
    ]
}

print("📊 MODEL RECOMMENDATIONS BY TASK\n")
for task, models in model_recommendations.items():
    print(f"\n{task}:")
    for model in models:
        print(f"  • {model}")

Series Roadmap¶

Module 1: Image Classification¶

CNN architectures (ResNet, EfficientNet)
Transfer learning
Fine-tuning strategies
Data augmentation

Module 2: Object Detection¶

YOLO architecture
Bounding box prediction
Non-Maximum Suppression (NMS)
Real-time detection

Module 3: Image Embeddings (CLIP)¶

Multimodal embeddings
Zero-shot classification
Visual search
Text-to-image retrieval

Module 4: Stable Diffusion¶

Diffusion models
Text-to-image generation
Image-to-image translation
ControlNet and LoRA

Module 5: Multimodal RAG¶

Visual question answering
Image + text retrieval
Document understanding
Multimodal chatbots

Module 6: Production Deployment¶

Model optimization (ONNX, TensorRT)
Batch processing
API deployment
Monitoring and scaling

Key Concepts¶

1. Transfer Learning¶

Using pre-trained models and fine-tuning for your task:

# Load pre-trained model
model = ResNet50(weights='imagenet')

# Freeze base layers
for param in model.parameters():
    param.requires_grad = False

# Replace final layer for your task
model.fc = nn.Linear(2048, num_classes)

2. Data Augmentation¶

Increase training data diversity:

Random crops and flips
Color jittering
Rotation and scaling
Cutout and mixup

3. Embeddings¶

Fixed-size vector representations:

# Extract features
features = model.encode_image(image)  # Shape: (1, 512)

# Compare images
similarity = cosine_similarity(features1, features2)

4. Zero-Shot Learning¶

Classify without training examples:

# CLIP can classify ANY category
labels = ["cat", "dog", "car", "airplane"]
predictions = model(image, labels)  # No training needed!

Real-World Applications¶

E-commerce¶

Visual search (“find similar products”)
Auto-tagging products
Quality inspection
Virtual try-on

Healthcare¶

Medical image analysis
Disease detection
Radiology assistance
Pathology classification

Autonomous Vehicles¶

Object detection (pedestrians, vehicles)
Lane detection
Traffic sign recognition
Depth estimation

Content Moderation¶

NSFW detection
Violence detection
Brand safety
Copyright detection

Creative Tools¶

AI art generation
Photo editing
Style transfer
Image restoration

What’s Next?¶

Notebook 1: Image Classification¶

Build image classifiers with ResNet and Vision Transformers

Notebook 2: Object Detection¶

Detect and locate objects with YOLO

Notebook 3: CLIP Embeddings¶

Multimodal understanding with text and images

Notebook 4: Stable Diffusion¶

Generate images from text prompts

Notebook 5: Multimodal RAG¶

Build visual question answering systems

Notebook 6: Production¶

Deploy CV models at scale

Resources¶

Papers:

“Deep Residual Learning” (ResNet, 2015)
“Attention Is All You Need” (Transformers, 2017)
“Learning Transferable Visual Models From Natural Language Supervision” (CLIP, 2021)
“High-Resolution Image Synthesis with Latent Diffusion Models” (Stable Diffusion, 2022)

Tools:

Hugging Face Transformers
PyTorch Vision (torchvision)
Ultralytics YOLO
Diffusers library

Datasets:

ImageNet (1.2M images, 1000 classes)
COCO (object detection)
OpenImages (9M images)
LAION-5B (for CLIP/Stable Diffusion)

Ready to start? → 01_image_classification.ipynb