Phase 14: Local LLMs¶

This module should help you answer a practical question: when does running models locally make sense, and what trade-offs do you accept in exchange for privacy, cost control, and deployment flexibility?

Actual Module Contents¶

Recommended Order¶

Start with Ollama and the model overview
Then build a local RAG workflow
Then study serving and API patterns
Finish with speculative decoding and performance considerations

What To Learn Here¶

The difference between hosted APIs and local inference
How quantization and model size affect usability
What Ollama is good at and where it is limiting
How to expose a local model behind an API
Why latency and throughput tuning matter once a prototype works

Study Advice¶

Keep the first pass practical: install one tool, run one model, ship one API.
Do not optimize before measuring.
Compare local quality against your hosted baseline before committing to an on-device stack.

Good Follow-On Projects¶

A private document assistant
A local coding helper with retrieval
A lightweight OpenAI-compatible local serving layer