Run this notebook: Open in Colab Open in Kaggle

Local RAG with Ollama¶

Build a fully local Retrieval-Augmented Generation (RAG) system — no cloud APIs, no data leaving your machine.

Why Local RAG?¶

Reason	Details
Privacy	Sensitive documents never leave your machine
Cost	Zero per-token charges after setup
Offline	Works without internet connection
Compliance	Meet HIPAA, GDPR, and SOC2 requirements
Latency	No network round-trip to a remote API

Architecture¶

┌────────────────────────────────────────────────────────┐
│                  Local RAG Pipeline                    │
│                                                        │
│  Documents ──► Chunker ──► nomic-embed-text ──► ChromaDB │
│                                         (embeddings)  │
│                                                        │
│  User Query ──► nomic-embed-text ──► ChromaDB retrieval │
│                                          │             │
│                           Context + Query ▼            │
│                              llama3.2 / mistral / phi4 │
│                                          │             │
│                                       Answer           │
└────────────────────────────────────────────────────────┘

Components:

Ollama — runs LLMs and embedding models locally
nomic-embed-text — local embedding model (768-dim)
ChromaDB — local vector store (persists to disk)
LangChain — orchestration layer
pypdf — PDF ingestion

Setup¶

1. Install Ollama and pull models¶

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull LLMs (choose based on your RAM)
ollama pull llama3.2        # 3B — needs ~4 GB RAM
ollama pull mistral         # 7B — needs ~8 GB RAM
ollama pull phi4            # 14B — needs ~16 GB RAM
ollama pull qwen2.5:7b      # 7B — needs ~8 GB RAM

# Pull embedding model (mandatory for local embeddings)
ollama pull nomic-embed-text

2. Install Python packages¶

# Install all required packages
!pip install -q ollama langchain langchain-ollama langchain-community chromadb pypdf requests numpy

import os
import json
import time
import requests
import numpy as np
from pathlib import Path
from typing import List, Dict, Optional

import ollama
import chromadb
from chromadb.config import Settings

print("All imports successful!")
print(f"ChromaDB version: {chromadb.__version__}")

1. Verify Ollama is Running¶

def check_ollama():
    """Check that Ollama server is reachable."""
    try:
        response = requests.get("http://localhost:11434", timeout=3)
        print("Ollama server: RUNNING")
    except requests.exceptions.ConnectionError:
        print("Ollama server: NOT RUNNING — start it with: ollama serve")
        return False

    # List available models
    models = ollama.list()
    available = [m['name'] for m in models.get('models', [])]
    print(f"Available models: {available}")

    required = ['nomic-embed-text']
    missing = [m for m in required if not any(m in a for a in available)]
    if missing:
        print(f"\nMissing required models: {missing}")
        print("Run: ollama pull nomic-embed-text")
    else:
        print("\nAll required models available!")
    return True

check_ollama()

2. Ollama Embedding API¶

Ollama exposes an /api/embeddings endpoint. The nomic-embed-text model produces 768-dimensional vectors and is optimised for retrieval tasks.

def embed(text: str, model: str = "nomic-embed-text") -> List[float]:
    """Generate an embedding vector for a piece of text."""
    response = ollama.embeddings(model=model, prompt=text)
    return response["embedding"]

def embed_batch(texts: List[str], model: str = "nomic-embed-text") -> List[List[float]]:
    """Embed a list of texts."""
    return [embed(t, model) for t in texts]

# Test it
sample_embedding = embed("Retrieval-Augmented Generation combines search with generation.")
print(f"Embedding dimensions: {len(sample_embedding)}")
print(f"First 5 values: {sample_embedding[:5]}")
print(f"Vector norm: {np.linalg.norm(sample_embedding):.4f}")

# Verify semantic similarity works as expected
def cosine_similarity(a: List[float], b: List[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

sentences = [
    "The cat sat on the mat.",
    "A feline rested on a rug.",       # Semantically similar
    "The stock market crashed today.",  # Unrelated
]

embeddings = embed_batch(sentences)

for i, s in enumerate(sentences[1:], 1):
    sim = cosine_similarity(embeddings[0], embeddings[i])
    print(f"Similarity: '{sentences[0]}' vs '{s}'")
    print(f"  Score: {sim:.4f}\n")

3. ChromaDB — Local Vector Store¶

ChromaDB stores document chunks and their embeddings on disk. It supports metadata filtering, persistent storage, and fast approximate nearest-neighbour search.

# Persistent ChromaDB — data survives notebook restarts
CHROMA_PATH = "./chroma_db"

chroma_client = chromadb.PersistentClient(
    path=CHROMA_PATH,
    settings=Settings(anonymized_telemetry=False)
)

# Create (or open existing) collection
COLLECTION_NAME = "local_rag_docs"

try:
    collection = chroma_client.get_collection(name=COLLECTION_NAME)
    print(f"Opened existing collection '{COLLECTION_NAME}' ({collection.count()} docs)")
except Exception:
    collection = chroma_client.create_collection(
        name=COLLECTION_NAME,
        metadata={"hnsw:space": "cosine"}  # Use cosine distance
    )
    print(f"Created new collection '{COLLECTION_NAME}'")

4. Document Ingestion Pipeline¶

Supports plain text, Markdown, and PDF files.

from pypdf import PdfReader

def load_text_file(path: str) -> str:
    """Load plain text or Markdown."""
    with open(path, "r", encoding="utf-8") as f:
        return f.read()

def load_pdf(path: str) -> str:
    """Extract text from a PDF."""
    reader = PdfReader(path)
    pages = [page.extract_text() or "" for page in reader.pages]
    return "\n\n".join(pages)

def load_document(path: str) -> str:
    """Auto-detect file type and load content."""
    suffix = Path(path).suffix.lower()
    if suffix == ".pdf":
        return load_pdf(path)
    elif suffix in (".txt", ".md", ".rst"):
        return load_text_file(path)
    else:
        raise ValueError(f"Unsupported file type: {suffix}")

print("Document loaders ready.")

5. Chunking Strategies¶

Chunking is critical for RAG quality. Chunks must be:

Large enough to contain a full thought
Small enough to be semantically focused
Overlapping so context is not cut off at boundaries

Rule of thumb for local 4K-context models: chunk size 400–600 tokens with 20% overlap.

def chunk_by_characters(
    text: str,
    chunk_size: int = 500,
    overlap: int = 100,
    separators: List[str] = ["\n\n", "\n", ". ", " "]
) -> List[str]:
    """
    Split text into overlapping chunks.
    Respects natural separators (paragraphs > sentences > words).
    """
    chunks = []
    start = 0

    while start < len(text):
        end = min(start + chunk_size, len(text))

        # Try to end at a natural boundary
        if end < len(text):
            for sep in separators:
                idx = text.rfind(sep, start, end)
                if idx != -1:
                    end = idx + len(sep)
                    break

        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)

        start = max(start + 1, end - overlap)

    return chunks


def chunk_by_paragraphs(text: str, min_length: int = 100) -> List[str]:
    """
    Split by paragraph boundaries.
    Good for well-structured documents.
    """
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
    return [p for p in paragraphs if len(p) >= min_length]


# Demo
sample_text = """
Retrieval-Augmented Generation (RAG) is a technique that improves LLM responses by
grounding them in external knowledge retrieved at inference time.

The process has two phases: indexing and retrieval. During indexing, documents are
split into chunks, converted into embedding vectors, and stored in a vector database.

During retrieval, the user query is embedded and the nearest document chunks are
fetched. These chunks are concatenated with the query and passed to the LLM as context,
enabling the model to answer questions about content it was not trained on.
""".strip()

char_chunks = chunk_by_characters(sample_text, chunk_size=300, overlap=60)
para_chunks = chunk_by_paragraphs(sample_text)

print(f"Character-based chunks: {len(char_chunks)}")
print(f"Paragraph-based chunks: {len(para_chunks)}")
print(f"\nFirst character chunk:\n{char_chunks[0]}")

6. Indexing Documents into ChromaDB¶

def index_text(
    text: str,
    source: str,
    collection: chromadb.Collection,
    chunk_size: int = 500,
    overlap: int = 100,
    embed_model: str = "nomic-embed-text",
    batch_size: int = 10,
) -> int:
    """
    Chunk, embed, and store a document in ChromaDB.

    Returns: number of chunks stored.
    """
    chunks = chunk_by_characters(text, chunk_size=chunk_size, overlap=overlap)
    print(f"Document '{source}': {len(chunks)} chunks")

    # Process in batches to avoid memory spikes
    total_stored = 0
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i : i + batch_size]
        embeddings = embed_batch(batch, model=embed_model)
        ids = [f"{source}_chunk_{i + j}" for j in range(len(batch))]
        metadatas = [{"source": source, "chunk_index": i + j} for j in range(len(batch))]

        collection.upsert(
            ids=ids,
            embeddings=embeddings,
            documents=batch,
            metadatas=metadatas,
        )
        total_stored += len(batch)
        print(f"  Stored batch {i // batch_size + 1}: {total_stored}/{len(chunks)} chunks")

    return total_stored


def index_file(
    path: str,
    collection: chromadb.Collection,
    **kwargs
) -> int:
    """Load a file from disk and index it."""
    text = load_document(path)
    source = Path(path).name
    return index_text(text, source=source, collection=collection, **kwargs)


print("Indexing functions ready.")

# Index sample documents (using in-memory text for the demo)
sample_docs = [
    (
        "intro_to_llms.txt",
        """
Large Language Models (LLMs) are neural networks trained on massive text corpora to predict
the next token in a sequence. They learn grammar, facts, reasoning patterns, and code from
the training data.

Modern LLMs use the Transformer architecture, introduced by Vaswani et al. in the 2017 paper
"Attention Is All You Need". The key mechanism is self-attention, which allows each token to
attend to every other token in the context window.

Popular open-source LLMs include Llama 3 (Meta), Mistral (Mistral AI), Phi-4 (Microsoft),
Qwen 2.5 (Alibaba), and Gemma 2 (Google). These can all be run locally using Ollama.

Training LLMs requires enormous compute — GPT-3 training cost roughly $4.6 million in cloud
GPU time. However, inference with quantised models can run on consumer hardware.
        """,
    ),
    (
        "rag_overview.txt",
        """
Retrieval-Augmented Generation (RAG) was introduced by Lewis et al. in 2020. It grounds LLM
outputs in a retrieved document corpus, reducing hallucinations and enabling knowledge updates
without expensive model retraining.

RAG pipeline stages:
1. Indexing: Split documents into chunks, embed each chunk, store in a vector store.
2. Retrieval: Embed the query, find the k most similar chunks (cosine similarity).
3. Augmentation: Prepend retrieved chunks to the prompt as context.
4. Generation: The LLM generates an answer conditioned on the context.

Advanced RAG techniques include HyDE (Hypothetical Document Embeddings), re-ranking with
a cross-encoder, query expansion, and multi-hop retrieval for complex questions.
        """,
    ),
    (
        "ollama_guide.txt",
        """
Ollama is an open-source tool that makes it easy to run LLMs locally. It handles model
downloading, GPU acceleration, and provides a REST API compatible with OpenAI's API.

Installation: curl -fsSL https://ollama.com/install.sh | sh

Key commands:
- ollama pull llama3.2    — Download a model
- ollama run llama3.2     — Interactive chat
- ollama list             — Show downloaded models
- ollama serve            — Start the API server (port 11434)

Ollama supports hardware acceleration via Metal on Apple Silicon, CUDA on NVIDIA GPUs,
and ROCm on AMD GPUs. It falls back to CPU if no GPU is detected.

The nomic-embed-text model provides 768-dimensional embeddings optimised for retrieval.
Pull it with: ollama pull nomic-embed-text
        """,
    ),
]

# Index all sample documents
for source_name, content in sample_docs:
    index_text(
        text=content.strip(),
        source=source_name,
        collection=collection,
        chunk_size=400,
        overlap=80,
    )
    print()

print(f"Total documents in collection: {collection.count()}")

7. Retrieval¶

Given a query, embed it and fetch the most relevant chunks.

def retrieve(
    query: str,
    collection: chromadb.Collection,
    k: int = 3,
    embed_model: str = "nomic-embed-text",
    source_filter: Optional[str] = None,
) -> List[Dict]:
    """
    Retrieve the top-k most relevant chunks for a query.

    Args:
        query: Natural language question.
        collection: ChromaDB collection to search.
        k: Number of chunks to return.
        embed_model: Ollama model for query embedding.
        source_filter: Optionally restrict to a single source file.

    Returns:
        List of dicts with 'text', 'source', 'distance' keys.
    """
    query_embedding = embed(query, model=embed_model)
    where = {"source": source_filter} if source_filter else None

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=k,
        where=where,
        include=["documents", "metadatas", "distances"],
    )

    chunks = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0],
    ):
        chunks.append({"text": doc, "source": meta["source"], "distance": dist})

    return chunks


# Test retrieval
query = "How does the attention mechanism work in Transformers?"
results = retrieve(query, collection, k=3)

print(f"Query: {query}\n")
for i, r in enumerate(results, 1):
    print(f"[{i}] Source: {r['source']} | Distance: {r['distance']:.4f}")
    print(f"    {r['text'][:200]}...\n")

8. Building the RAG Chain¶

Combine retrieval with Ollama LLM generation.

def build_prompt(query: str, context_chunks: List[Dict]) -> str:
    """Assemble the RAG prompt."""
    context = "\n\n---\n\n".join(
        f"[Source: {c['source']}]\n{c['text']}" for c in context_chunks
    )
    return f"""You are a helpful assistant. Answer the question using ONLY the provided context.
If the context does not contain enough information, say so — do not make things up.

Context:
{context}

Question: {query}

Answer:"""


def rag_query(
    query: str,
    collection: chromadb.Collection,
    llm_model: str = "llama3.2",
    k: int = 3,
    stream: bool = False,
    temperature: float = 0.1,
) -> str:
    """
    Full RAG pipeline: retrieve -> augment -> generate.

    Args:
        query: User question.
        collection: ChromaDB collection.
        llm_model: Ollama model to use for generation.
        k: Number of context chunks to retrieve.
        stream: If True, print tokens as they are generated.
        temperature: LLM temperature (low = factual, high = creative).

    Returns:
        Generated answer string.
    """
    # Step 1: Retrieve
    chunks = retrieve(query, collection, k=k)

    # Step 2: Augment
    prompt = build_prompt(query, chunks)

    # Step 3: Generate
    if stream:
        print("Answer: ", end="", flush=True)
        full_response = ""
        for part in ollama.chat(
            model=llm_model,
            messages=[{"role": "user", "content": prompt}],
            stream=True,
            options={"temperature": temperature},
        ):
            token = part["message"]["content"]
            print(token, end="", flush=True)
            full_response += token
        print()
        return full_response
    else:
        response = ollama.chat(
            model=llm_model,
            messages=[{"role": "user", "content": prompt}],
            options={"temperature": temperature},
        )
        return response["message"]["content"]


print("RAG chain ready.")

9. Complete Local Q&A System¶

# Run queries against your local document store
test_queries = [
    "What is the key mechanism in the Transformer architecture?",
    "How do I install Ollama and pull a model?",
    "What are the four stages of a RAG pipeline?",
    "Which hardware does Ollama support for acceleration?",
]

for query in test_queries:
    print(f"Q: {query}")
    answer = rag_query(query, collection, llm_model="llama3.2", k=3)
    print(f"A: {answer}")
    print("-" * 60)

10. Interactive Q&A Loop¶

class LocalRAGSystem:
    """
    Fully local, privacy-preserving RAG system.
    No data leaves your machine.
    """

    def __init__(
        self,
        collection: chromadb.Collection,
        llm_model: str = "llama3.2",
        embed_model: str = "nomic-embed-text",
        k: int = 4,
    ):
        self.collection = collection
        self.llm_model = llm_model
        self.embed_model = embed_model
        self.k = k

    def add_document(self, text: str, source: str) -> int:
        """Add a document to the knowledge base."""
        return index_text(text, source=source, collection=self.collection)

    def add_file(self, path: str) -> int:
        """Load and index a file."""
        return index_file(path, collection=self.collection)

    def ask(self, question: str, stream: bool = True) -> str:
        """Ask a question and get an answer grounded in indexed documents."""
        return rag_query(
            query=question,
            collection=self.collection,
            llm_model=self.llm_model,
            k=self.k,
            stream=stream,
        )

    def sources(self) -> List[str]:
        """List all indexed source documents."""
        results = self.collection.get(include=["metadatas"])
        sources = list({m["source"] for m in results["metadatas"]})
        return sorted(sources)

    def stats(self):
        """Print collection stats."""
        print(f"Model (LLM): {self.llm_model}")
        print(f"Model (embeddings): {self.embed_model}")
        print(f"Total chunks: {self.collection.count()}")
        print(f"Sources: {self.sources()}")


# Create the RAG system
rag = LocalRAGSystem(
    collection=collection,
    llm_model="llama3.2",
    k=4,
)

rag.stats()

# Ask questions with streaming output
print("=" * 60)
print("Local RAG System — all processing on your machine")
print("=" * 60 + "\n")

rag.ask("Explain RAG and why it reduces hallucinations.", stream=True)

11. Performance Comparison — Models for RAG¶

Different LLMs have different strengths for RAG tasks. Key factors:

Speed (tokens/sec)
Faithfulness (does the answer stay grounded in context?)
Context utilisation (does it use all retrieved chunks?)

def benchmark_model(
    model: str,
    queries: List[str],
    collection: chromadb.Collection,
    k: int = 3,
) -> Dict:
    """
    Benchmark a model on a set of queries.
    Returns timing and answer statistics.
    """
    times = []
    answers = []

    for q in queries:
        start = time.perf_counter()
        answer = rag_query(q, collection, llm_model=model, k=k, stream=False)
        elapsed = time.perf_counter() - start
        times.append(elapsed)
        answers.append(answer)

    # Estimate tokens per second (rough: avg word is ~1.3 tokens)
    avg_words = np.mean([len(a.split()) for a in answers])
    avg_time = np.mean(times)
    tps = (avg_words * 1.3) / avg_time

    return {
        "model": model,
        "avg_latency_s": round(avg_time, 2),
        "est_tokens_per_sec": round(tps, 1),
        "avg_answer_words": round(avg_words, 1),
    }


# Models to benchmark — comment out models you haven't pulled
MODELS_TO_TEST = [
    "llama3.2",
    # "mistral",   # Uncomment if pulled
    # "phi4",      # Uncomment if pulled
    # "qwen2.5:7b", # Uncomment if pulled
]

BENCHMARK_QUERIES = [
    "What hardware acceleration does Ollama support?",
    "What year was RAG introduced and by whom?",
    "How does self-attention work?",
]

results = []
for model in MODELS_TO_TEST:
    print(f"Benchmarking {model}...")
    try:
        stats = benchmark_model(model, BENCHMARK_QUERIES, collection)
        results.append(stats)
        print(f"  Latency: {stats['avg_latency_s']}s | ~{stats['est_tokens_per_sec']} tok/s")
    except Exception as e:
        print(f"  Skipped ({e})")

if results:
    print("\nBenchmark Summary:")
    for r in sorted(results, key=lambda x: x['est_tokens_per_sec'], reverse=True):
        print(f"  {r['model']:20} | {r['avg_latency_s']:5}s | {r['est_tokens_per_sec']:6} tok/s")

12. Quantisation Levels (q4 vs q8 vs Full)¶

Ollama downloads models in GGUF format. You can choose the quantisation level to trade quality for speed/memory.

# Pull specific quantisation levels
ollama pull llama3.2:3b           # default (usually Q4_K_M)
ollama pull llama3.2:3b-instruct-q4_K_M   # 4-bit, K-quant medium
ollama pull llama3.2:3b-instruct-q8_0     # 8-bit (higher quality, more RAM)
ollama pull llama3.2:3b-instruct-fp16     # full 16-bit (best quality, most RAM)

Quantisation	Memory	Quality	Speed
`fp16`	100%	Best	Slow
`q8_0`	~50%	Very good	Medium
`q4_K_M`	~25%	Good	Fast
`q4_K_S`	~23%	Decent	Fastest

Recommendation for RAG: q4_K_M is the sweet spot — 4-bit medium K-quant gives near-fp16 quality at a quarter of the memory cost.

# Show model sizes and quantisation info for installed models
models_info = ollama.list()

print("Installed Ollama models:")
print(f"{'Model':<35} {'Size (GB)':<12} {'Quantisation'}")
print("-" * 65)

for m in models_info.get("models", []):
    name = m.get("name", "?")
    size_gb = m.get("size", 0) / 1e9
    details = m.get("details", {})
    quant = details.get("quantization_level", "unknown")
    print(f"{name:<35} {size_gb:<12.2f} {quant}")

13. Hardware Acceleration¶

Ollama automatically detects and uses available hardware. No configuration needed.

Apple Silicon (M1/M2/M3/M4 — MPS)¶

# Check Metal acceleration is active
ollama show llama3.2 --verbose
# Look for: num_gpu_layers = N (N > 0 means GPU is being used)

Linux — NVIDIA (CUDA)¶

# Verify GPU detection
nvidia-smi
ollama run llama3.2  # Ollama uses CUDA automatically

CPU Fallback¶

If no GPU is detected, Ollama falls back to CPU using llama.cpp’s optimised CPU kernels. Use smaller models (1B–3B) for acceptable speed on CPU.

import platform
import subprocess

def detect_hardware():
    """Detect available hardware and print recommendations."""
    system = platform.system()
    machine = platform.machine()
    print(f"OS: {system} | Architecture: {machine}")

    # macOS Apple Silicon
    if system == "Darwin" and machine == "arm64":
        print("Hardware: Apple Silicon (MPS via Metal)")
        # Get memory info on macOS
        try:
            result = subprocess.run(
                ["sysctl", "-n", "hw.memsize"], capture_output=True, text=True
            )
            ram_gb = int(result.stdout.strip()) / 1e9
            print(f"Unified Memory: {ram_gb:.0f} GB")
            if ram_gb >= 32:
                print("Recommendation: Use 14B–32B models (phi4, qwen2.5:14b)")
            elif ram_gb >= 16:
                print("Recommendation: Use 7B–14B models (mistral, qwen2.5:7b)")
            else:
                print("Recommendation: Use 1B–7B models (llama3.2:3b, llama3.2:1b)")
        except Exception:
            print("Could not read memory info.")

    # Linux / Windows — try nvidia-smi
    elif system in ("Linux", "Windows"):
        try:
            result = subprocess.run(
                ["nvidia-smi", "--query-gpu=name,memory.total",
                 "--format=csv,noheader"],
                capture_output=True, text=True, timeout=5,
            )
            if result.returncode == 0:
                print(f"Hardware: NVIDIA GPU detected")
                print(result.stdout.strip())
            else:
                print("Hardware: CPU only (no NVIDIA GPU found)")
        except FileNotFoundError:
            print("Hardware: CPU only (nvidia-smi not found)")

detect_hardware()

14. LangChain Integration¶

LangChain’s OllamaEmbeddings and ChatOllama classes wrap Ollama for use in LangChain pipelines.

# LangChain + Ollama + ChromaDB
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Embeddings via Ollama
lc_embeddings = OllamaEmbeddings(model="nomic-embed-text")

# LLM via Ollama
lc_llm = ChatOllama(model="llama3.2", temperature=0.1)

# Build LangChain documents from our sample texts
lc_docs = [
    Document(page_content=content.strip(), metadata={"source": name})
    for name, content in sample_docs
]

# Text splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=80,
    separators=["\n\n", "\n", ". ", " "],
)
split_docs = splitter.split_documents(lc_docs)
print(f"Total chunks after splitting: {len(split_docs)}")

# Vector store (LangChain wrapper around ChromaDB)
LANGCHAIN_CHROMA_PATH = "./chroma_langchain"
lc_vectorstore = Chroma.from_documents(
    documents=split_docs,
    embedding=lc_embeddings,
    persist_directory=LANGCHAIN_CHROMA_PATH,
)
print(f"LangChain vector store created at {LANGCHAIN_CHROMA_PATH}")

# Build the RetrievalQA chain
RAG_PROMPT = PromptTemplate(
    template="""Use the following context to answer the question.
If you don't know the answer from the context, say you don't know.

Context:
{context}

Question: {question}
Answer:""",
    input_variables=["context", "question"],
)

qa_chain = RetrievalQA.from_chain_type(
    llm=lc_llm,
    chain_type="stuff",
    retriever=lc_vectorstore.as_retriever(search_kwargs={"k": 4}),
    chain_type_kwargs={"prompt": RAG_PROMPT},
    return_source_documents=True,
)

# Run a query
result = qa_chain.invoke({"query": "What models can I run with Ollama?"})
print("Answer:", result["result"])
print("\nSources used:")
for doc in result["source_documents"]:
    print(f"  - {doc.metadata['source']}")

15. Privacy-Preserving Chat Over Sensitive Documents¶

Key guarantees of this setup:

All embeddings generated locally by nomic-embed-text via Ollama
All LLM inference by local Ollama model
ChromaDB stores data on local disk only
Zero outbound network requests during inference

# Demonstrate with a "confidential" document
confidential_content = """
CONFIDENTIAL — Q4 2025 Financial Summary

Revenue: $142.3M (up 23% YoY)
Gross Margin: 68.4%
Operating Expenses: $89.1M
EBITDA: $53.2M

Top Products by Revenue:
1. Enterprise SaaS Platform — $87.4M
2. Data Analytics Suite — $31.2M
3. Professional Services — $23.7M

Headcount: 1,247 employees (up from 984 in Q4 2024)
New Logos: 312 enterprise customers
Churn Rate: 4.2% (down from 6.1%)

Key Risks:
- Macro headwinds may compress SMB segment
- APAC expansion delayed to Q2 2026
""".strip()

# Create a private collection for sensitive data
try:
    private_collection = chroma_client.get_collection("private_docs")
except Exception:
    private_collection = chroma_client.create_collection(
        "private_docs", metadata={"hnsw:space": "cosine"}
    )

index_text(confidential_content, source="q4_financials.txt", collection=private_collection)

# Query sensitive data locally — nothing leaves your machine
print("\nAsking questions about confidential document (100% local):\n")

questions = [
    "What was the revenue in Q4 2025?",
    "What are the key business risks?",
    "How many new enterprise customers were added?",
]

for q in questions:
    answer = rag_query(q, private_collection, llm_model="llama3.2", k=2)
    print(f"Q: {q}")
    print(f"A: {answer}\n")

Key Takeaways¶

Zero cloud dependency — Ollama + nomic-embed-text + ChromaDB gives a complete local stack
nomic-embed-text is the go-to local embedding model — 768 dims, retrieval-optimised, Apache 2.0
Chunk size matters — 400–600 chars with 80-100 char overlap works well for 4K-context models
Quantisation sweet spot — q4_K_M gives ~fp16 quality at 25% memory cost
Apple Silicon users get Metal (MPS) acceleration automatically; NVIDIA users get CUDA
LangChain wraps everything for more complex chains (agents, multi-hop retrieval)

Next Steps¶

04_llm_server_and_api.ipynb — Expose local LLMs as OpenAI-compatible API servers
Add re-ranking with a cross-encoder for better retrieval precision
Explore HyDE (generate a hypothetical answer, embed it for retrieval)
Add a conversational memory layer for multi-turn Q&A