Multimodal AI β€” Start HereΒΆ

AI that understands and generates text, images, audio, and video together.

What You’ll LearnΒΆ

Track

Notebooks

Topics

Vision-Language

vision-language/

CLIP, GPT-4V, LLaVA, multimodal RAG

Image Generation

image-generation/

Stable Diffusion, ControlNet, DALL-E

Audio & Speech

audio/

Whisper ASR, TTS, voice cloning

PrerequisitesΒΆ

  • Neural Networks (Phase 06)

  • Embeddings (Phase 05)

  • RAG Systems (Phase 08) β€” helpful for multimodal RAG

Learning PathΒΆ

vision-language/01_clip_basics.ipynb          ← Start here
vision-language/02_vision_language_models.ipynb
vision-language/03_multimodal_rag.ipynb
image-generation/01_stable_diffusion.ipynb
image-generation/02_controlnet.ipynb
audio/01_whisper_speech_recognition.ipynb
audio/02_text_to_speech.ipynb