Phase 13: Multimodal AIΒΆ
π― OverviewΒΆ
Go beyond text! Learn to work with Vision-Language Models, Audio AI, and multimodal systems that combine text, images, audio, and video.
Prerequisites:
β Neural Networks & Transformers (Phase 5)
β LLMs & Prompt Engineering (Phase 10)
β Python & PyTorch
Time: 3-4 weeks | 60-80 hours
Outcome: Build AI systems that understand and generate across multiple modalities
π What Youβll LearnΒΆ
Vision-Language Models (VLMs)ΒΆ
CLIP (Contrastive Language-Image Pretraining)
LLaVA (Large Language and Vision Assistant)
GPT-4V capabilities and API
Gemini Pro Vision
Image captioning and VQA (Visual Question Answering)
Zero-shot image classification
Image GenerationΒΆ
Stable Diffusion architecture
DALL-E 3 API
Midjourney concepts
ControlNet for guided generation
LoRA for Stable Diffusion
Prompt engineering for images
Audio & SpeechΒΆ
Whisper (speech-to-text)
Text-to-Speech models (Bark, XTTS)
Audio classification
Music generation (MusicGen)
Voice cloning
Audio embeddings
Video UnderstandingΒΆ
Video captioning
Action recognition
Temporal understanding
Video generation (emerging)
Multimodal RAGΒΆ
Image + text search
Document understanding (OCR + LLM)
Multimodal embeddings
Cross-modal retrieval
ποΈ Module StructureΒΆ
12-multimodal/
βββ 00_START_HERE.ipynb # Overview & capabilities
βββ vision-language/
β βββ 01_clip_basics.ipynb # CLIP fundamentals
β βββ 02_llava.ipynb # Open-source VLM
β βββ 03_gpt4v.ipynb # GPT-4 Vision
β βββ 04_image_captioning.ipynb # Generate descriptions
β βββ 05_visual_qa.ipynb # Answer image questions
β βββ 06_zero_shot_classification.ipynb
βββ image-generation/
β βββ 01_stable_diffusion_basics.ipynb
β βββ 02_prompt_engineering.ipynb # Image prompts
β βββ 03_controlnet.ipynb # Guided generation
β βββ 04_lora_training.ipynb # Custom styles
β βββ 05_dalle3_api.ipynb # OpenAI API
β βββ 06_image_editing.ipynb # Inpainting, etc.
βββ audio/
β βββ 01_whisper_speech_to_text.ipynb
β βββ 02_text_to_speech.ipynb
β βββ 03_audio_classification.ipynb
β βββ 04_music_generation.ipynb
β βββ 05_voice_cloning.ipynb
βββ video/
β βββ 01_video_understanding.ipynb
β βββ 02_action_recognition.ipynb
β βββ 03_video_captioning.ipynb
βββ multimodal-rag/
β βββ 01_image_text_search.ipynb
β βββ 02_document_understanding.ipynb
β βββ 03_multimodal_embeddings.ipynb
β βββ 04_cross_modal_retrieval.ipynb
βββ projects/
βββ image_analyzer.py # Analyze and caption images
βββ visual_chatbot.py # Chat about images
βββ audio_transcriber.py # Full transcription system
βββ image_generator.py # Custom image generation
βββ multimodal_search.py # Search images by text
π Quick StartΒΆ
Example 1: CLIP - Zero-Shot ClassificationΒΆ
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
# Load CLIP
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Load image
image = Image.open("photo.jpg")
# Define categories
labels = ["a cat", "a dog", "a bird", "a car"]
# Process
inputs = processor(
text=labels,
images=image,
return_tensors="pt",
padding=True
)
# Get similarities
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
# Results
for label, prob in zip(labels, probs[0]):
print(f"{label}: {prob:.2%}")
Example 2: GPT-4 Vision APIΒΆ
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image? Describe in detail."},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg"
}
}
]
}],
max_tokens=300
)
print(response.choices[0].message.content)
Example 3: Whisper - Speech to TextΒΆ
import whisper
# Load model (tiny, base, small, medium, large)
model = whisper.load_model("base")
# Transcribe
result = model.transcribe("audio.mp3")
print(result["text"])
# Also available: word-level timestamps, language detection
Example 4: Stable DiffusionΒΆ
from diffusers import StableDiffusionPipeline
import torch
# Load model
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
# Generate
prompt = "A beautiful sunset over mountains, oil painting style"
image = pipe(
prompt,
negative_prompt="blurry, low quality",
num_inference_steps=30,
guidance_scale=7.5
).images[0]
image.save("output.png")
π Learning PathΒΆ
Week 1: Vision-Language BasicsΒΆ
Complete
00_START_HERE.ipynbCLIP fundamentals in
vision-language/01_clip_basics.ipynbTry GPT-4V in
vision-language/03_gpt4v.ipynbProject: Build image classifier
Week 2: Image GenerationΒΆ
Stable Diffusion in
image-generation/01_stable_diffusion_basics.ipynbPrompt engineering in
image-generation/02_prompt_engineering.ipynbControlNet in
image-generation/03_controlnet.ipynbProject: Custom image generator
Week 3: Audio & VideoΒΆ
Whisper in
audio/01_whisper_speech_to_text.ipynbTTS in
audio/02_text_to_speech.ipynbVideo understanding in
video/Project: Audio transcription system
Week 4: Multimodal RAGΒΆ
Image+text search in
multimodal-rag/01_image_text_search.ipynbDocument understanding in
multimodal-rag/02_document_understanding.ipynbBuild complete system
Capstone: Multimodal search engine
π οΈ Technologies Youβll UseΒΆ
Vision-Language Models:
CLIP (OpenAI)
LLaVA (open-source)
GPT-4V (OpenAI)
Gemini Pro Vision (Google)
BLIP-2, InstructBLIP
Image Generation:
Stable Diffusion (open-source)
DALL-E 3 (OpenAI)
Midjourney (via API)
ControlNet, T2I-Adapter
IP-Adapter
Audio Models:
Whisper (OpenAI)
Bark (Suno AI)
XTTS (Coqui)
MusicGen (Meta)
AudioCraft
Frameworks:
Hugging Face Transformers
Diffusers
OpenCV
torchaudio
librosa
π Key ConceptsΒΆ
CLIP ArchitectureΒΆ
Image β Vision Transformer β Image Embedding
Text β Text Transformer β Text Embedding
Training: Maximize similarity of matching pairs,
minimize similarity of non-matching pairs
Applications:
Zero-shot classification
Image search by text
Content moderation
Feature extraction
Stable Diffusion PipelineΒΆ
Text β CLIP β Text Embedding
β
U-Net (denoising)
β
VAE Decoder β Image
Key Parameters:
num_inference_steps: Quality vs speed (20-50)guidance_scale: Prompt adherence (7-15)negative_prompt: What to avoidseed: Reproducibility
Multimodal EmbeddingsΒΆ
# Same embedding space for text and images!
text_embedding = clip.encode_text("a red car")
image_embedding = clip.encode_image(car_image)
# Compute similarity
similarity = cosine_similarity(text_embedding, image_embedding)
π― ProjectsΒΆ
1. Visual ChatbotΒΆ
Chat with images using GPT-4V or LLaVA.
Skills: VLM integration, conversation memory
2. Image Generator AppΒΆ
Stable Diffusion with custom UI and parameters.
Skills: Diffusion models, prompt engineering, UI
3. Meeting TranscriberΒΆ
Record, transcribe, summarize with Whisper + LLM.
Skills: Audio processing, LLM integration
4. Visual Search EngineΒΆ
Search image library by text description.
Skills: CLIP embeddings, vector search, multimodal RAG
5. Document QA SystemΒΆ
Answer questions about PDFs with images/charts.
Skills: OCR, vision models, RAG
π‘ Best PracticesΒΆ
Vision-LanguageΒΆ
DO β
Use specific, detailed prompts
Provide image context
Chain vision β reasoning β action
Handle image quality issues
Validate outputs
DONβT β
Assume perfect OCR
Ignore image resolution
Skip error handling
Trust all outputs blindly
Image GenerationΒΆ
DO β
Use negative prompts
Iterate on prompts
Control with ControlNet
Use appropriate steps (30-50)
Set random seed for consistency
DONβT β
Use default prompts only
Expect perfection first try
Ignore quality settings
Generate at max resolution always (slow!)
Audio ProcessingΒΆ
DO β
Preprocess audio (denoise)
Use appropriate model size
Check language detection
Validate transcriptions
Handle silence/noise
DONβT β
Process very long files without chunking
Ignore audio quality
Skip timestamp alignment
π ResourcesΒΆ
CoursesΒΆ
PapersΒΆ
Tools & APIsΒΆ
ModelsΒΆ
β Completion ChecklistΒΆ
Before moving forward, you should be able to:
Use CLIP for zero-shot classification
Build image captioning systems
Generate images with Stable Diffusion
Optimize image prompts
Transcribe audio with Whisper
Understand VLM architectures
Build multimodal RAG systems
Combine text and visual search
Deploy multimodal applications
Handle edge cases (quality, errors)
π Whatβs Next?ΒΆ
Phase 9: AI Agents β
Agents with vision capabilities
Tool use with multimodal inputs
Autonomous systems
Phase 11: LLM Fine-tuning β
Fine-tune vision-language models
Custom image generation models
Specialized multimodal systems
Real-World Applications β
Accessibility tools
Content moderation
Visual search
Creative tools
Ready to go multimodal? β Start with 00_START_HERE.ipynb
Questions? β Check the projects/ folder for complete examples
π¨ Remember: A picture is worth a thousand tokens!