Computer Vision Specialization β€” Start HereΒΆ

Overview of the Computer Vision track: image classification, object detection, CLIP embeddings, Stable Diffusion, and multimodal RAG.

# Install dependencies
# !pip install torch torchvision transformers pillow matplotlib

Quick Start: Image ClassificationΒΆ

from PIL import Image
import requests
from io import BytesIO

# Load a sample image
def load_image_from_url(url: str) -> Image.Image:
    """Load image from URL"""
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))
    return img

# Sample image URLs (replace with your own)
sample_images = {
    "cat": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/400px-Cat03.jpg",
    "dog": "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a3/June_odd-eyed-cat.jpg/300px-June_odd-eyed-cat.jpg"
}

# For this demo, we'll create a placeholder
print("Image loading utility ready")
print("In production, load real images using PIL or opencv")

Simple Classification with TransformersΒΆ

# Example with Hugging Face Transformers (requires installation)
'''
from transformers import pipeline

# Load image classification model
classifier = pipeline("image-classification", model="microsoft/resnet-50")

# Classify an image
image = load_image_from_url(sample_images["cat"])
results = classifier(image)

print("Top predictions:")
for result in results[:3]:
    print(f"  {result['label']}: {result['score']:.2%}")
'''

print("Image classification example (commented - requires transformers)")
print("\nTypical output:")
print("  Egyptian cat: 87.32%")
print("  Tabby cat: 8.45%")
print("  Tiger cat: 2.11%")

Understanding Image PreprocessingΒΆ

import numpy as np
from typing import Tuple

class ImagePreprocessor:
    """Preprocess images for model input"""
    
    def __init__(self, target_size: Tuple[int, int] = (224, 224)):
        self.target_size = target_size
    
    def resize(self, image: Image.Image) -> Image.Image:
        """Resize to target size"""
        return image.resize(self.target_size, Image.Resampling.LANCZOS)
    
    def normalize(self, image_array: np.ndarray) -> np.ndarray:
        """Normalize pixel values to [0, 1]"""
        return image_array.astype(np.float32) / 255.0
    
    def standardize(self, image_array: np.ndarray, 
                    mean: Tuple[float, float, float] = (0.485, 0.456, 0.406),
                    std: Tuple[float, float, float] = (0.229, 0.224, 0.225)) -> np.ndarray:
        """Standardize using ImageNet statistics"""
        mean_array = np.array(mean).reshape(1, 1, 3)
        std_array = np.array(std).reshape(1, 1, 3)
        return (image_array - mean_array) / std_array
    
    def to_tensor(self, image_array: np.ndarray) -> np.ndarray:
        """Convert to CHW format (channels first)"""
        # From HWC (Height, Width, Channels) to CHW
        return np.transpose(image_array, (2, 0, 1))
    
    def preprocess(self, image: Image.Image) -> np.ndarray:
        """Full preprocessing pipeline"""
        # 1. Resize
        img_resized = self.resize(image)
        
        # 2. Convert to array
        img_array = np.array(img_resized)
        
        # 3. Normalize to [0, 1]
        img_normalized = self.normalize(img_array)
        
        # 4. Standardize (ImageNet stats)
        img_standardized = self.standardize(img_normalized)
        
        # 5. Convert to CHW format
        img_tensor = self.to_tensor(img_standardized)
        
        return img_tensor

# Test preprocessor
preprocessor = ImagePreprocessor(target_size=(224, 224))

# Create a dummy image
dummy_image = Image.new('RGB', (500, 500), color='red')
processed = preprocessor.preprocess(dummy_image)

print(f"Original image size: {dummy_image.size}")
print(f"Processed tensor shape: {processed.shape}")  # Should be (3, 224, 224)
print(f"Value range: [{processed.min():.2f}, {processed.max():.2f}]")

Computer Vision PipelineΒΆ

A typical CV application follows this flow:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Input Image β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Preprocessing   β”‚ ← Resize, normalize, augment
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Feature Extract β”‚ ← CNN or ViT backbone
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Task-Specific   β”‚ ← Classification/Detection/etc.
β”‚ Head            β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Post-processing β”‚ ← NMS, thresholding
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Output          β”‚ ← Labels, boxes, masks
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Modern CV ArchitecturesΒΆ

1. Convolutional Neural Networks (CNNs)ΒΆ

  • ResNet: Skip connections, 50-152 layers

  • EfficientNet: Balanced scaling (width, depth, resolution)

  • MobileNet: Lightweight for mobile devices

2. Vision Transformers (ViT)ΒΆ

  • Split image into patches

  • Apply self-attention

  • Better for large datasets

3. Hybrid ModelsΒΆ

  • CLIP: Text + Vision understanding

  • DINO: Self-supervised learning

  • SAM: Segment Anything Model

# Popular model choices for different tasks
model_recommendations = {
    "Classification": [
        "ResNet-50 (fast, accurate)",
        "EfficientNet-B0 (efficient)",
        "ViT-Base (transformer-based)"
    ],
    "Object Detection": [
        "YOLOv8 (real-time)",
        "DETR (transformer-based)",
        "Faster R-CNN (accurate)"
    ],
    "Segmentation": [
        "SAM (Segment Anything)",
        "Mask R-CNN",
        "U-Net (medical images)"
    ],
    "Multimodal": [
        "CLIP (OpenAI)",
        "BLIP (Salesforce)",
        "LLaVA (visual chatbot)"
    ],
    "Generation": [
        "Stable Diffusion",
        "DALL-E 3",
        "Midjourney"
    ]
}

print("πŸ“Š MODEL RECOMMENDATIONS BY TASK\n")
for task, models in model_recommendations.items():
    print(f"\n{task}:")
    for model in models:
        print(f"  β€’ {model}")

Series RoadmapΒΆ

Module 1: Image ClassificationΒΆ

  • CNN architectures (ResNet, EfficientNet)

  • Transfer learning

  • Fine-tuning strategies

  • Data augmentation

Module 2: Object DetectionΒΆ

  • YOLO architecture

  • Bounding box prediction

  • Non-Maximum Suppression (NMS)

  • Real-time detection

Module 3: Image Embeddings (CLIP)ΒΆ

  • Multimodal embeddings

  • Zero-shot classification

  • Visual search

  • Text-to-image retrieval

Module 4: Stable DiffusionΒΆ

  • Diffusion models

  • Text-to-image generation

  • Image-to-image translation

  • ControlNet and LoRA

Module 5: Multimodal RAGΒΆ

  • Visual question answering

  • Image + text retrieval

  • Document understanding

  • Multimodal chatbots

Module 6: Production DeploymentΒΆ

  • Model optimization (ONNX, TensorRT)

  • Batch processing

  • API deployment

  • Monitoring and scaling

Key ConceptsΒΆ

1. Transfer LearningΒΆ

Using pre-trained models and fine-tuning for your task:

# Load pre-trained model
model = ResNet50(weights='imagenet')

# Freeze base layers
for param in model.parameters():
    param.requires_grad = False

# Replace final layer for your task
model.fc = nn.Linear(2048, num_classes)

2. Data AugmentationΒΆ

Increase training data diversity:

  • Random crops and flips

  • Color jittering

  • Rotation and scaling

  • Cutout and mixup

3. EmbeddingsΒΆ

Fixed-size vector representations:

# Extract features
features = model.encode_image(image)  # Shape: (1, 512)

# Compare images
similarity = cosine_similarity(features1, features2)

4. Zero-Shot LearningΒΆ

Classify without training examples:

# CLIP can classify ANY category
labels = ["cat", "dog", "car", "airplane"]
predictions = model(image, labels)  # No training needed!

Real-World ApplicationsΒΆ

E-commerceΒΆ

  • Visual search (β€œfind similar products”)

  • Auto-tagging products

  • Quality inspection

  • Virtual try-on

HealthcareΒΆ

  • Medical image analysis

  • Disease detection

  • Radiology assistance

  • Pathology classification

Autonomous VehiclesΒΆ

  • Object detection (pedestrians, vehicles)

  • Lane detection

  • Traffic sign recognition

  • Depth estimation

Content ModerationΒΆ

  • NSFW detection

  • Violence detection

  • Brand safety

  • Copyright detection

Creative ToolsΒΆ

  • AI art generation

  • Photo editing

  • Style transfer

  • Image restoration

What’s Next?ΒΆ

Notebook 1: Image ClassificationΒΆ

Build image classifiers with ResNet and Vision Transformers

Notebook 2: Object DetectionΒΆ

Detect and locate objects with YOLO

Notebook 3: CLIP EmbeddingsΒΆ

Multimodal understanding with text and images

Notebook 4: Stable DiffusionΒΆ

Generate images from text prompts

Notebook 5: Multimodal RAGΒΆ

Build visual question answering systems

Notebook 6: ProductionΒΆ

Deploy CV models at scale

ResourcesΒΆ

Papers:

  • β€œDeep Residual Learning” (ResNet, 2015)

  • β€œAttention Is All You Need” (Transformers, 2017)

  • β€œLearning Transferable Visual Models From Natural Language Supervision” (CLIP, 2021)

  • β€œHigh-Resolution Image Synthesis with Latent Diffusion Models” (Stable Diffusion, 2022)

Tools:

  • Hugging Face Transformers

  • PyTorch Vision (torchvision)

  • Ultralytics YOLO

  • Diffusers library

Datasets:

  • ImageNet (1.2M images, 1000 classes)

  • COCO (object detection)

  • OpenImages (9M images)

  • LAION-5B (for CLIP/Stable Diffusion)

Ready to start? β†’ 01_image_classification.ipynb