Multimodal AI β Start HereΒΆ
AI that understands and generates text, images, audio, and video together.
What Youβll LearnΒΆ
Track |
Notebooks |
Topics |
|---|---|---|
Vision-Language |
|
CLIP, GPT-4V, LLaVA, multimodal RAG |
Image Generation |
|
Stable Diffusion, ControlNet, DALL-E |
Audio & Speech |
|
Whisper ASR, TTS, voice cloning |
PrerequisitesΒΆ
Neural Networks (Phase 06)
Embeddings (Phase 05)
RAG Systems (Phase 08) β helpful for multimodal RAG
Learning PathΒΆ
vision-language/01_clip_basics.ipynb β Start here
vision-language/02_vision_language_models.ipynb
vision-language/03_multimodal_rag.ipynb
image-generation/01_stable_diffusion.ipynb
image-generation/02_controlnet.ipynb
audio/01_whisper_speech_recognition.ipynb
audio/02_text_to_speech.ipynb