Phase 5: EmbeddingsΒΆ

Embeddings are the bridge between raw text and everything that follows in this repo: semantic search, vector databases, RAG, clustering, retrieval evaluation, and recommendation-style systems.

What To Learn HereΒΆ

  • How text is mapped into dense vectors

  • Why cosine similarity is the default comparison metric

  • The difference between word, token, sentence, and sparse embeddings

  • When to use local models vs hosted APIs

  • How embeddings become a practical search pipeline

Learning GoalsΒΆ

By the end of this phase, you should be able to:

  • Explain why embeddings make semantic retrieval possible

  • Generate embeddings with both local and API-based workflows

  • Compare pooling strategies at a high level

  • Build a minimal semantic search flow

  • Choose an embedding approach based on quality, latency, and cost constraints

PrerequisitesΒΆ

  • Tokenization fundamentals from 04-token/

  • Basic linear algebra intuition from 03-maths/

  • Enough Python to run notebooks and inspect arrays

Good Study StrategyΒΆ

  • Do not treat every notebook as mandatory on the first pass.

  • Focus first on concept transfer: similarity, search, and trade-offs.

  • Return later for sparse retrieval and model-comparison detail when you start Phase 6 and Phase 7.

What To Build After ThisΒΆ

  • A semantic FAQ search system

  • A duplicate-detection tool for documents

  • A chunk-and-retrieve pipeline that feeds Phase 7 RAG work

Companion FilesΒΆ