Local RAG with OllamaΒΆ
Build a fully local Retrieval-Augmented Generation (RAG) system β no cloud APIs, no data leaving your machine.
Why Local RAG?ΒΆ
Reason |
Details |
|---|---|
Privacy |
Sensitive documents never leave your machine |
Cost |
Zero per-token charges after setup |
Offline |
Works without internet connection |
Compliance |
Meet HIPAA, GDPR, and SOC2 requirements |
Latency |
No network round-trip to a remote API |
ArchitectureΒΆ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Local RAG Pipeline β
β β
β Documents βββΊ Chunker βββΊ nomic-embed-text βββΊ ChromaDB β
β (embeddings) β
β β
β User Query βββΊ nomic-embed-text βββΊ ChromaDB retrieval β
β β β
β Context + Query βΌ β
β llama3.2 / mistral / phi4 β
β β β
β Answer β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Components:
Ollama β runs LLMs and embedding models locally
nomic-embed-text β local embedding model (768-dim)
ChromaDB β local vector store (persists to disk)
LangChain β orchestration layer
pypdf β PDF ingestion
SetupΒΆ
1. Install Ollama and pull modelsΒΆ
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull LLMs (choose based on your RAM)
ollama pull llama3.2 # 3B β needs ~4 GB RAM
ollama pull mistral # 7B β needs ~8 GB RAM
ollama pull phi4 # 14B β needs ~16 GB RAM
ollama pull qwen2.5:7b # 7B β needs ~8 GB RAM
# Pull embedding model (mandatory for local embeddings)
ollama pull nomic-embed-text
2. Install Python packagesΒΆ
# Install all required packages
!pip install -q ollama langchain langchain-ollama langchain-community chromadb pypdf requests numpy
import os
import json
import time
import requests
import numpy as np
from pathlib import Path
from typing import List, Dict, Optional
import ollama
import chromadb
from chromadb.config import Settings
print("All imports successful!")
print(f"ChromaDB version: {chromadb.__version__}")
1. Verify Ollama is RunningΒΆ
def check_ollama():
"""Check that Ollama server is reachable."""
try:
response = requests.get("http://localhost:11434", timeout=3)
print("Ollama server: RUNNING")
except requests.exceptions.ConnectionError:
print("Ollama server: NOT RUNNING β start it with: ollama serve")
return False
# List available models
models = ollama.list()
available = [m['name'] for m in models.get('models', [])]
print(f"Available models: {available}")
required = ['nomic-embed-text']
missing = [m for m in required if not any(m in a for a in available)]
if missing:
print(f"\nMissing required models: {missing}")
print("Run: ollama pull nomic-embed-text")
else:
print("\nAll required models available!")
return True
check_ollama()
2. Ollama Embedding APIΒΆ
Ollama exposes an /api/embeddings endpoint. The nomic-embed-text model produces 768-dimensional vectors and is optimised for retrieval tasks.
def embed(text: str, model: str = "nomic-embed-text") -> List[float]:
"""Generate an embedding vector for a piece of text."""
response = ollama.embeddings(model=model, prompt=text)
return response["embedding"]
def embed_batch(texts: List[str], model: str = "nomic-embed-text") -> List[List[float]]:
"""Embed a list of texts."""
return [embed(t, model) for t in texts]
# Test it
sample_embedding = embed("Retrieval-Augmented Generation combines search with generation.")
print(f"Embedding dimensions: {len(sample_embedding)}")
print(f"First 5 values: {sample_embedding[:5]}")
print(f"Vector norm: {np.linalg.norm(sample_embedding):.4f}")
# Verify semantic similarity works as expected
def cosine_similarity(a: List[float], b: List[float]) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
sentences = [
"The cat sat on the mat.",
"A feline rested on a rug.", # Semantically similar
"The stock market crashed today.", # Unrelated
]
embeddings = embed_batch(sentences)
for i, s in enumerate(sentences[1:], 1):
sim = cosine_similarity(embeddings[0], embeddings[i])
print(f"Similarity: '{sentences[0]}' vs '{s}'")
print(f" Score: {sim:.4f}\n")
3. ChromaDB β Local Vector StoreΒΆ
ChromaDB stores document chunks and their embeddings on disk. It supports metadata filtering, persistent storage, and fast approximate nearest-neighbour search.
# Persistent ChromaDB β data survives notebook restarts
CHROMA_PATH = "./chroma_db"
chroma_client = chromadb.PersistentClient(
path=CHROMA_PATH,
settings=Settings(anonymized_telemetry=False)
)
# Create (or open existing) collection
COLLECTION_NAME = "local_rag_docs"
try:
collection = chroma_client.get_collection(name=COLLECTION_NAME)
print(f"Opened existing collection '{COLLECTION_NAME}' ({collection.count()} docs)")
except Exception:
collection = chroma_client.create_collection(
name=COLLECTION_NAME,
metadata={"hnsw:space": "cosine"} # Use cosine distance
)
print(f"Created new collection '{COLLECTION_NAME}'")
4. Document Ingestion PipelineΒΆ
Supports plain text, Markdown, and PDF files.
from pypdf import PdfReader
def load_text_file(path: str) -> str:
"""Load plain text or Markdown."""
with open(path, "r", encoding="utf-8") as f:
return f.read()
def load_pdf(path: str) -> str:
"""Extract text from a PDF."""
reader = PdfReader(path)
pages = [page.extract_text() or "" for page in reader.pages]
return "\n\n".join(pages)
def load_document(path: str) -> str:
"""Auto-detect file type and load content."""
suffix = Path(path).suffix.lower()
if suffix == ".pdf":
return load_pdf(path)
elif suffix in (".txt", ".md", ".rst"):
return load_text_file(path)
else:
raise ValueError(f"Unsupported file type: {suffix}")
print("Document loaders ready.")
5. Chunking StrategiesΒΆ
Chunking is critical for RAG quality. Chunks must be:
Large enough to contain a full thought
Small enough to be semantically focused
Overlapping so context is not cut off at boundaries
Rule of thumb for local 4K-context models: chunk size 400β600 tokens with 20% overlap.
def chunk_by_characters(
text: str,
chunk_size: int = 500,
overlap: int = 100,
separators: List[str] = ["\n\n", "\n", ". ", " "]
) -> List[str]:
"""
Split text into overlapping chunks.
Respects natural separators (paragraphs > sentences > words).
"""
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
# Try to end at a natural boundary
if end < len(text):
for sep in separators:
idx = text.rfind(sep, start, end)
if idx != -1:
end = idx + len(sep)
break
chunk = text[start:end].strip()
if chunk:
chunks.append(chunk)
start = max(start + 1, end - overlap)
return chunks
def chunk_by_paragraphs(text: str, min_length: int = 100) -> List[str]:
"""
Split by paragraph boundaries.
Good for well-structured documents.
"""
paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
return [p for p in paragraphs if len(p) >= min_length]
# Demo
sample_text = """
Retrieval-Augmented Generation (RAG) is a technique that improves LLM responses by
grounding them in external knowledge retrieved at inference time.
The process has two phases: indexing and retrieval. During indexing, documents are
split into chunks, converted into embedding vectors, and stored in a vector database.
During retrieval, the user query is embedded and the nearest document chunks are
fetched. These chunks are concatenated with the query and passed to the LLM as context,
enabling the model to answer questions about content it was not trained on.
""".strip()
char_chunks = chunk_by_characters(sample_text, chunk_size=300, overlap=60)
para_chunks = chunk_by_paragraphs(sample_text)
print(f"Character-based chunks: {len(char_chunks)}")
print(f"Paragraph-based chunks: {len(para_chunks)}")
print(f"\nFirst character chunk:\n{char_chunks[0]}")
6. Indexing Documents into ChromaDBΒΆ
def index_text(
text: str,
source: str,
collection: chromadb.Collection,
chunk_size: int = 500,
overlap: int = 100,
embed_model: str = "nomic-embed-text",
batch_size: int = 10,
) -> int:
"""
Chunk, embed, and store a document in ChromaDB.
Returns: number of chunks stored.
"""
chunks = chunk_by_characters(text, chunk_size=chunk_size, overlap=overlap)
print(f"Document '{source}': {len(chunks)} chunks")
# Process in batches to avoid memory spikes
total_stored = 0
for i in range(0, len(chunks), batch_size):
batch = chunks[i : i + batch_size]
embeddings = embed_batch(batch, model=embed_model)
ids = [f"{source}_chunk_{i + j}" for j in range(len(batch))]
metadatas = [{"source": source, "chunk_index": i + j} for j in range(len(batch))]
collection.upsert(
ids=ids,
embeddings=embeddings,
documents=batch,
metadatas=metadatas,
)
total_stored += len(batch)
print(f" Stored batch {i // batch_size + 1}: {total_stored}/{len(chunks)} chunks")
return total_stored
def index_file(
path: str,
collection: chromadb.Collection,
**kwargs
) -> int:
"""Load a file from disk and index it."""
text = load_document(path)
source = Path(path).name
return index_text(text, source=source, collection=collection, **kwargs)
print("Indexing functions ready.")
# Index sample documents (using in-memory text for the demo)
sample_docs = [
(
"intro_to_llms.txt",
"""
Large Language Models (LLMs) are neural networks trained on massive text corpora to predict
the next token in a sequence. They learn grammar, facts, reasoning patterns, and code from
the training data.
Modern LLMs use the Transformer architecture, introduced by Vaswani et al. in the 2017 paper
"Attention Is All You Need". The key mechanism is self-attention, which allows each token to
attend to every other token in the context window.
Popular open-source LLMs include Llama 3 (Meta), Mistral (Mistral AI), Phi-4 (Microsoft),
Qwen 2.5 (Alibaba), and Gemma 2 (Google). These can all be run locally using Ollama.
Training LLMs requires enormous compute β GPT-3 training cost roughly $4.6 million in cloud
GPU time. However, inference with quantised models can run on consumer hardware.
""",
),
(
"rag_overview.txt",
"""
Retrieval-Augmented Generation (RAG) was introduced by Lewis et al. in 2020. It grounds LLM
outputs in a retrieved document corpus, reducing hallucinations and enabling knowledge updates
without expensive model retraining.
RAG pipeline stages:
1. Indexing: Split documents into chunks, embed each chunk, store in a vector store.
2. Retrieval: Embed the query, find the k most similar chunks (cosine similarity).
3. Augmentation: Prepend retrieved chunks to the prompt as context.
4. Generation: The LLM generates an answer conditioned on the context.
Advanced RAG techniques include HyDE (Hypothetical Document Embeddings), re-ranking with
a cross-encoder, query expansion, and multi-hop retrieval for complex questions.
""",
),
(
"ollama_guide.txt",
"""
Ollama is an open-source tool that makes it easy to run LLMs locally. It handles model
downloading, GPU acceleration, and provides a REST API compatible with OpenAI's API.
Installation: curl -fsSL https://ollama.com/install.sh | sh
Key commands:
- ollama pull llama3.2 β Download a model
- ollama run llama3.2 β Interactive chat
- ollama list β Show downloaded models
- ollama serve β Start the API server (port 11434)
Ollama supports hardware acceleration via Metal on Apple Silicon, CUDA on NVIDIA GPUs,
and ROCm on AMD GPUs. It falls back to CPU if no GPU is detected.
The nomic-embed-text model provides 768-dimensional embeddings optimised for retrieval.
Pull it with: ollama pull nomic-embed-text
""",
),
]
# Index all sample documents
for source_name, content in sample_docs:
index_text(
text=content.strip(),
source=source_name,
collection=collection,
chunk_size=400,
overlap=80,
)
print()
print(f"Total documents in collection: {collection.count()}")
7. RetrievalΒΆ
Given a query, embed it and fetch the most relevant chunks.
def retrieve(
query: str,
collection: chromadb.Collection,
k: int = 3,
embed_model: str = "nomic-embed-text",
source_filter: Optional[str] = None,
) -> List[Dict]:
"""
Retrieve the top-k most relevant chunks for a query.
Args:
query: Natural language question.
collection: ChromaDB collection to search.
k: Number of chunks to return.
embed_model: Ollama model for query embedding.
source_filter: Optionally restrict to a single source file.
Returns:
List of dicts with 'text', 'source', 'distance' keys.
"""
query_embedding = embed(query, model=embed_model)
where = {"source": source_filter} if source_filter else None
results = collection.query(
query_embeddings=[query_embedding],
n_results=k,
where=where,
include=["documents", "metadatas", "distances"],
)
chunks = []
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
):
chunks.append({"text": doc, "source": meta["source"], "distance": dist})
return chunks
# Test retrieval
query = "How does the attention mechanism work in Transformers?"
results = retrieve(query, collection, k=3)
print(f"Query: {query}\n")
for i, r in enumerate(results, 1):
print(f"[{i}] Source: {r['source']} | Distance: {r['distance']:.4f}")
print(f" {r['text'][:200]}...\n")
8. Building the RAG ChainΒΆ
Combine retrieval with Ollama LLM generation.
def build_prompt(query: str, context_chunks: List[Dict]) -> str:
"""Assemble the RAG prompt."""
context = "\n\n---\n\n".join(
f"[Source: {c['source']}]\n{c['text']}" for c in context_chunks
)
return f"""You are a helpful assistant. Answer the question using ONLY the provided context.
If the context does not contain enough information, say so β do not make things up.
Context:
{context}
Question: {query}
Answer:"""
def rag_query(
query: str,
collection: chromadb.Collection,
llm_model: str = "llama3.2",
k: int = 3,
stream: bool = False,
temperature: float = 0.1,
) -> str:
"""
Full RAG pipeline: retrieve -> augment -> generate.
Args:
query: User question.
collection: ChromaDB collection.
llm_model: Ollama model to use for generation.
k: Number of context chunks to retrieve.
stream: If True, print tokens as they are generated.
temperature: LLM temperature (low = factual, high = creative).
Returns:
Generated answer string.
"""
# Step 1: Retrieve
chunks = retrieve(query, collection, k=k)
# Step 2: Augment
prompt = build_prompt(query, chunks)
# Step 3: Generate
if stream:
print("Answer: ", end="", flush=True)
full_response = ""
for part in ollama.chat(
model=llm_model,
messages=[{"role": "user", "content": prompt}],
stream=True,
options={"temperature": temperature},
):
token = part["message"]["content"]
print(token, end="", flush=True)
full_response += token
print()
return full_response
else:
response = ollama.chat(
model=llm_model,
messages=[{"role": "user", "content": prompt}],
options={"temperature": temperature},
)
return response["message"]["content"]
print("RAG chain ready.")
9. Complete Local Q&A SystemΒΆ
# Run queries against your local document store
test_queries = [
"What is the key mechanism in the Transformer architecture?",
"How do I install Ollama and pull a model?",
"What are the four stages of a RAG pipeline?",
"Which hardware does Ollama support for acceleration?",
]
for query in test_queries:
print(f"Q: {query}")
answer = rag_query(query, collection, llm_model="llama3.2", k=3)
print(f"A: {answer}")
print("-" * 60)
10. Interactive Q&A LoopΒΆ
class LocalRAGSystem:
"""
Fully local, privacy-preserving RAG system.
No data leaves your machine.
"""
def __init__(
self,
collection: chromadb.Collection,
llm_model: str = "llama3.2",
embed_model: str = "nomic-embed-text",
k: int = 4,
):
self.collection = collection
self.llm_model = llm_model
self.embed_model = embed_model
self.k = k
def add_document(self, text: str, source: str) -> int:
"""Add a document to the knowledge base."""
return index_text(text, source=source, collection=self.collection)
def add_file(self, path: str) -> int:
"""Load and index a file."""
return index_file(path, collection=self.collection)
def ask(self, question: str, stream: bool = True) -> str:
"""Ask a question and get an answer grounded in indexed documents."""
return rag_query(
query=question,
collection=self.collection,
llm_model=self.llm_model,
k=self.k,
stream=stream,
)
def sources(self) -> List[str]:
"""List all indexed source documents."""
results = self.collection.get(include=["metadatas"])
sources = list({m["source"] for m in results["metadatas"]})
return sorted(sources)
def stats(self):
"""Print collection stats."""
print(f"Model (LLM): {self.llm_model}")
print(f"Model (embeddings): {self.embed_model}")
print(f"Total chunks: {self.collection.count()}")
print(f"Sources: {self.sources()}")
# Create the RAG system
rag = LocalRAGSystem(
collection=collection,
llm_model="llama3.2",
k=4,
)
rag.stats()
# Ask questions with streaming output
print("=" * 60)
print("Local RAG System β all processing on your machine")
print("=" * 60 + "\n")
rag.ask("Explain RAG and why it reduces hallucinations.", stream=True)
11. Performance Comparison β Models for RAGΒΆ
Different LLMs have different strengths for RAG tasks. Key factors:
Speed (tokens/sec)
Faithfulness (does the answer stay grounded in context?)
Context utilisation (does it use all retrieved chunks?)
def benchmark_model(
model: str,
queries: List[str],
collection: chromadb.Collection,
k: int = 3,
) -> Dict:
"""
Benchmark a model on a set of queries.
Returns timing and answer statistics.
"""
times = []
answers = []
for q in queries:
start = time.perf_counter()
answer = rag_query(q, collection, llm_model=model, k=k, stream=False)
elapsed = time.perf_counter() - start
times.append(elapsed)
answers.append(answer)
# Estimate tokens per second (rough: avg word is ~1.3 tokens)
avg_words = np.mean([len(a.split()) for a in answers])
avg_time = np.mean(times)
tps = (avg_words * 1.3) / avg_time
return {
"model": model,
"avg_latency_s": round(avg_time, 2),
"est_tokens_per_sec": round(tps, 1),
"avg_answer_words": round(avg_words, 1),
}
# Models to benchmark β comment out models you haven't pulled
MODELS_TO_TEST = [
"llama3.2",
# "mistral", # Uncomment if pulled
# "phi4", # Uncomment if pulled
# "qwen2.5:7b", # Uncomment if pulled
]
BENCHMARK_QUERIES = [
"What hardware acceleration does Ollama support?",
"What year was RAG introduced and by whom?",
"How does self-attention work?",
]
results = []
for model in MODELS_TO_TEST:
print(f"Benchmarking {model}...")
try:
stats = benchmark_model(model, BENCHMARK_QUERIES, collection)
results.append(stats)
print(f" Latency: {stats['avg_latency_s']}s | ~{stats['est_tokens_per_sec']} tok/s")
except Exception as e:
print(f" Skipped ({e})")
if results:
print("\nBenchmark Summary:")
for r in sorted(results, key=lambda x: x['est_tokens_per_sec'], reverse=True):
print(f" {r['model']:20} | {r['avg_latency_s']:5}s | {r['est_tokens_per_sec']:6} tok/s")
12. Quantisation Levels (q4 vs q8 vs Full)ΒΆ
Ollama downloads models in GGUF format. You can choose the quantisation level to trade quality for speed/memory.
# Pull specific quantisation levels
ollama pull llama3.2:3b # default (usually Q4_K_M)
ollama pull llama3.2:3b-instruct-q4_K_M # 4-bit, K-quant medium
ollama pull llama3.2:3b-instruct-q8_0 # 8-bit (higher quality, more RAM)
ollama pull llama3.2:3b-instruct-fp16 # full 16-bit (best quality, most RAM)
Quantisation |
Memory |
Quality |
Speed |
|---|---|---|---|
|
100% |
Best |
Slow |
|
~50% |
Very good |
Medium |
|
~25% |
Good |
Fast |
|
~23% |
Decent |
Fastest |
Recommendation for RAG: q4_K_M is the sweet spot β 4-bit medium K-quant gives near-fp16 quality at a quarter of the memory cost.
# Show model sizes and quantisation info for installed models
models_info = ollama.list()
print("Installed Ollama models:")
print(f"{'Model':<35} {'Size (GB)':<12} {'Quantisation'}")
print("-" * 65)
for m in models_info.get("models", []):
name = m.get("name", "?")
size_gb = m.get("size", 0) / 1e9
details = m.get("details", {})
quant = details.get("quantization_level", "unknown")
print(f"{name:<35} {size_gb:<12.2f} {quant}")
13. Hardware AccelerationΒΆ
Ollama automatically detects and uses available hardware. No configuration needed.
Apple Silicon (M1/M2/M3/M4 β MPS)ΒΆ
# Check Metal acceleration is active
ollama show llama3.2 --verbose
# Look for: num_gpu_layers = N (N > 0 means GPU is being used)
Linux β NVIDIA (CUDA)ΒΆ
# Verify GPU detection
nvidia-smi
ollama run llama3.2 # Ollama uses CUDA automatically
CPU FallbackΒΆ
If no GPU is detected, Ollama falls back to CPU using llama.cppβs optimised CPU kernels. Use smaller models (1Bβ3B) for acceptable speed on CPU.
import platform
import subprocess
def detect_hardware():
"""Detect available hardware and print recommendations."""
system = platform.system()
machine = platform.machine()
print(f"OS: {system} | Architecture: {machine}")
# macOS Apple Silicon
if system == "Darwin" and machine == "arm64":
print("Hardware: Apple Silicon (MPS via Metal)")
# Get memory info on macOS
try:
result = subprocess.run(
["sysctl", "-n", "hw.memsize"], capture_output=True, text=True
)
ram_gb = int(result.stdout.strip()) / 1e9
print(f"Unified Memory: {ram_gb:.0f} GB")
if ram_gb >= 32:
print("Recommendation: Use 14Bβ32B models (phi4, qwen2.5:14b)")
elif ram_gb >= 16:
print("Recommendation: Use 7Bβ14B models (mistral, qwen2.5:7b)")
else:
print("Recommendation: Use 1Bβ7B models (llama3.2:3b, llama3.2:1b)")
except Exception:
print("Could not read memory info.")
# Linux / Windows β try nvidia-smi
elif system in ("Linux", "Windows"):
try:
result = subprocess.run(
["nvidia-smi", "--query-gpu=name,memory.total",
"--format=csv,noheader"],
capture_output=True, text=True, timeout=5,
)
if result.returncode == 0:
print(f"Hardware: NVIDIA GPU detected")
print(result.stdout.strip())
else:
print("Hardware: CPU only (no NVIDIA GPU found)")
except FileNotFoundError:
print("Hardware: CPU only (nvidia-smi not found)")
detect_hardware()
14. LangChain IntegrationΒΆ
LangChainβs OllamaEmbeddings and ChatOllama classes wrap Ollama for use in LangChain pipelines.
# LangChain + Ollama + ChromaDB
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# Embeddings via Ollama
lc_embeddings = OllamaEmbeddings(model="nomic-embed-text")
# LLM via Ollama
lc_llm = ChatOllama(model="llama3.2", temperature=0.1)
# Build LangChain documents from our sample texts
lc_docs = [
Document(page_content=content.strip(), metadata={"source": name})
for name, content in sample_docs
]
# Text splitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=400,
chunk_overlap=80,
separators=["\n\n", "\n", ". ", " "],
)
split_docs = splitter.split_documents(lc_docs)
print(f"Total chunks after splitting: {len(split_docs)}")
# Vector store (LangChain wrapper around ChromaDB)
LANGCHAIN_CHROMA_PATH = "./chroma_langchain"
lc_vectorstore = Chroma.from_documents(
documents=split_docs,
embedding=lc_embeddings,
persist_directory=LANGCHAIN_CHROMA_PATH,
)
print(f"LangChain vector store created at {LANGCHAIN_CHROMA_PATH}")
# Build the RetrievalQA chain
RAG_PROMPT = PromptTemplate(
template="""Use the following context to answer the question.
If you don't know the answer from the context, say you don't know.
Context:
{context}
Question: {question}
Answer:""",
input_variables=["context", "question"],
)
qa_chain = RetrievalQA.from_chain_type(
llm=lc_llm,
chain_type="stuff",
retriever=lc_vectorstore.as_retriever(search_kwargs={"k": 4}),
chain_type_kwargs={"prompt": RAG_PROMPT},
return_source_documents=True,
)
# Run a query
result = qa_chain.invoke({"query": "What models can I run with Ollama?"})
print("Answer:", result["result"])
print("\nSources used:")
for doc in result["source_documents"]:
print(f" - {doc.metadata['source']}")
15. Privacy-Preserving Chat Over Sensitive DocumentsΒΆ
Key guarantees of this setup:
All embeddings generated locally by
nomic-embed-textvia OllamaAll LLM inference by local Ollama model
ChromaDB stores data on local disk only
Zero outbound network requests during inference
# Demonstrate with a "confidential" document
confidential_content = """
CONFIDENTIAL β Q4 2025 Financial Summary
Revenue: $142.3M (up 23% YoY)
Gross Margin: 68.4%
Operating Expenses: $89.1M
EBITDA: $53.2M
Top Products by Revenue:
1. Enterprise SaaS Platform β $87.4M
2. Data Analytics Suite β $31.2M
3. Professional Services β $23.7M
Headcount: 1,247 employees (up from 984 in Q4 2024)
New Logos: 312 enterprise customers
Churn Rate: 4.2% (down from 6.1%)
Key Risks:
- Macro headwinds may compress SMB segment
- APAC expansion delayed to Q2 2026
""".strip()
# Create a private collection for sensitive data
try:
private_collection = chroma_client.get_collection("private_docs")
except Exception:
private_collection = chroma_client.create_collection(
"private_docs", metadata={"hnsw:space": "cosine"}
)
index_text(confidential_content, source="q4_financials.txt", collection=private_collection)
# Query sensitive data locally β nothing leaves your machine
print("\nAsking questions about confidential document (100% local):\n")
questions = [
"What was the revenue in Q4 2025?",
"What are the key business risks?",
"How many new enterprise customers were added?",
]
for q in questions:
answer = rag_query(q, private_collection, llm_model="llama3.2", k=2)
print(f"Q: {q}")
print(f"A: {answer}\n")
Key TakeawaysΒΆ
Zero cloud dependency β Ollama + nomic-embed-text + ChromaDB gives a complete local stack
nomic-embed-text is the go-to local embedding model β 768 dims, retrieval-optimised, Apache 2.0
Chunk size matters β 400β600 chars with 80-100 char overlap works well for 4K-context models
Quantisation sweet spot β
q4_K_Mgives ~fp16 quality at 25% memory costApple Silicon users get Metal (MPS) acceleration automatically; NVIDIA users get CUDA
LangChain wraps everything for more complex chains (agents, multi-hop retrieval)
Next StepsΒΆ
04_llm_server_and_api.ipynbβ Expose local LLMs as OpenAI-compatible API serversAdd re-ranking with a cross-encoder for better retrieval precision
Explore HyDE (generate a hypothetical answer, embed it for retrieval)
Add a conversational memory layer for multi-turn Q&A