Setup¶
1. Install Ollama¶
macOS/Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download from https://ollama.com/download
### 2. Install Python Client
# Install Ollama Python library
# !pip install ollama
import ollama
import json
from pprint import pprint
print("Ollama client ready!")
1. Download and Run a Model¶
Available Models (December 2025)¶
Model |
Size |
Best For |
RAM Needed |
|---|---|---|---|
llama3.3 |
70B |
Best quality |
48GB |
llama3.2 |
3B |
Fast, general |
4GB |
llama3.2 |
1B |
Tiny, fast |
2GB |
qwen2.5 |
72B |
Multilingual, best |
48GB |
qwen2.5 |
14B |
Multilingual, balanced |
16GB |
qwen2.5 |
7B |
Multilingual |
8GB |
qwen2.5-coder |
32B |
Code generation |
24GB |
qwen2.5-coder |
7B |
Code |
8GB |
deepseek-r1 |
70B |
Reasoning, math |
48GB |
deepseek-r1 |
14B |
Reasoning |
16GB |
phi-4 |
14B |
Coding, math, reasoning |
16GB |
mistral |
7B |
Balanced |
8GB |
gemma2 |
27B |
Google’s best |
24GB |
gemma2 |
9B |
Google, balanced |
12GB |
# Pull a model (first time only - downloads to ~/.ollama)
# This happens automatically on first use, but you can pre-download:
# ollama.pull('llama3.2') # Fast 3B model
# ollama.pull('phi-4') # Best for coding & reasoning (Dec 2025)
# ollama.pull('qwen2.5-coder') # Excellent for code
# ollama.pull('deepseek-r1:14b') # Best for reasoning tasks
# ollama.pull('gemma2:9b') # Google's balanced model
# List downloaded models
models = ollama.list()
print("Available models:")
for model in models.get('models', []):
name = model['name']
size = model.get('size', 0) / 1e9 # Convert to GB
print(f" - {name} ({size:.1f} GB)")
2. Basic Chat¶
The ollama.chat() function sends a prompt to a locally running model and returns the complete response. Unlike cloud APIs, there is no network latency or rate limiting – the bottleneck is your hardware’s inference speed. The streaming variant (stream=True) yields tokens as they are generated, providing a responsive user experience similar to ChatGPT. The messages parameter uses the same format as the OpenAI API (role/content pairs), making it easy to migrate code between local and cloud models.
def chat(prompt, model='llama3.2'):
"""Simple chat interface."""
response = ollama.chat(
model=model,
messages=[{
'role': 'user',
'content': prompt
}]
)
return response['message']['content']
# Test
answer = chat("Explain what a neural network is in one sentence.")
print(answer)
# With streaming for real-time output
def chat_stream(prompt, model='llama3.2'):
"""Chat with streaming response."""
stream = ollama.chat(
model=model,
messages=[{'role': 'user', 'content': prompt}],
stream=True
)
full_response = ""
for chunk in stream:
content = chunk['message']['content']
print(content, end='', flush=True)
full_response += content
print() # Newline
return full_response
# Test streaming
response = chat_stream("Write a haiku about AI.")
3. Conversation with Context¶
LLMs are stateless – each API call is independent. To maintain a multi-turn conversation, you must send the full message history with every request. The Conversation class below manages this history automatically, appending each user message and assistant response to a growing list. An optional system prompt sets the model’s persona and behavior guidelines. Keep in mind that longer histories consume more context tokens and slow down generation, so for extended conversations you may need to implement summarization or sliding-window truncation.
# Multi-turn conversation
messages = [
{'role': 'user', 'content': 'What is the capital of France?'},
{'role': 'assistant', 'content': 'The capital of France is Paris.'},
{'role': 'user', 'content': 'What is its population?'}
]
response = ollama.chat(
model='llama3.2',
messages=messages
)
print(response['message']['content'])
# Conversation class
class Conversation:
def __init__(self, model='llama3.2', system_prompt=None):
self.model = model
self.messages = []
if system_prompt:
self.messages.append({
'role': 'system',
'content': system_prompt
})
def send(self, message):
"""Send message and get response."""
self.messages.append({
'role': 'user',
'content': message
})
response = ollama.chat(
model=self.model,
messages=self.messages
)
assistant_message = response['message']['content']
self.messages.append({
'role': 'assistant',
'content': assistant_message
})
return assistant_message
def reset(self):
"""Clear conversation history."""
self.messages = []
# Use it
conv = Conversation(
model='llama3.2',
system_prompt="You are a helpful Python programming assistant."
)
print(conv.send("What is a list comprehension?"))
print("\n" + "="*50 + "\n")
print(conv.send("Show me an example."))
4. Code Generation¶
Specialized code models like CodeLlama, DeepSeek-Coder, and Phi-4 are trained on large code corpora and significantly outperform general-purpose models on programming tasks. The generate_code() function below uses a structured prompt that specifies the programming language and requests only code output (no explanations), which produces cleaner results. Running code generation locally is especially valuable for enterprise environments where proprietary code cannot be sent to external APIs.
# Use code-specialized model
def generate_code(task, language='python'):
"""Generate code for a task."""
prompt = f"""Write {language} code to {task}.
Only output the code, no explanations.
Code:"""
response = ollama.chat(
model='codellama', # or 'deepseek-coder', 'phi-3'
messages=[{'role': 'user', 'content': prompt}]
)
return response['message']['content']
# Test
code = generate_code("calculate fibonacci numbers using dynamic programming")
print(code)
5. Structured Output (JSON)¶
Extracting structured data from unstructured text is one of the most practical LLM applications. Ollama’s format='json' parameter constrains the model’s output to valid JSON, eliminating parsing errors caused by stray text or markdown formatting. The extract_json() function takes a schema description and input text, then returns a Python dictionary. This approach is the local equivalent of OpenAI’s structured output mode and works well for information extraction, form filling, and data normalization tasks.
# Extract structured data
def extract_json(text, schema_description):
"""Extract structured information as JSON."""
prompt = f"""Extract the following information from the text and return as JSON:
{schema_description}
Text: {text}
JSON:"""
response = ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': prompt}],
format='json' # Force JSON output
)
return json.loads(response['message']['content'])
# Test
text = """John Smith works at Acme Corp as a Senior Engineer.
He can be reached at john.smith@acme.com or 555-1234."""
schema = """{
"name": "full name",
"company": "company name",
"title": "job title",
"email": "email address",
"phone": "phone number"
}"""
result = extract_json(text, schema)
pprint(result)
6. Embeddings for RAG¶
Ollama can also run embedding models locally, converting text into dense vector representations for similarity search. The nomic-embed-text model is a compact embedding model optimized for retrieval tasks. By generating embeddings locally, you can build a fully private RAG pipeline where neither the documents nor the queries ever leave your machine. The cosine similarity search below demonstrates the core retrieval step: encode the query and all documents, then find the document whose embedding is closest to the query embedding in vector space.
# Generate embeddings locally
def get_embedding(text):
"""Get text embedding."""
response = ollama.embeddings(
model='nomic-embed-text', # Specialized embedding model
prompt=text
)
return response['embedding']
# Test
embedding = get_embedding("Machine learning is awesome!")
print(f"Embedding dimension: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")
# Simple similarity search
import numpy as np
def cosine_similarity(a, b):
"""Compute cosine similarity."""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Documents
docs = [
"Python is a programming language.",
"Machine learning uses neural networks.",
"The weather is sunny today.",
"Deep learning is a subset of AI."
]
# Get embeddings
doc_embeddings = [get_embedding(doc) for doc in docs]
# Query
query = "Tell me about artificial intelligence"
query_embedding = get_embedding(query)
# Find most similar
similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
best_idx = np.argmax(similarities)
print(f"Query: {query}")
print(f"\nMost relevant document:")
print(f" {docs[best_idx]}")
print(f" Similarity: {similarities[best_idx]:.4f}")
7. Model Parameters¶
Generation behavior can be fine-tuned through sampling parameters passed via the options dictionary. Temperature controls randomness: 0.0 produces deterministic, greedy output while values above 1.0 increase creativity at the risk of incoherence. top_p (nucleus sampling) and top_k further control the token selection distribution. repeat_penalty discourages the model from repeating phrases, which is a common issue in local models. These parameters are passed directly to the model’s inference engine and take effect immediately – no reloading required.
# Custom parameters
response = ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Write a creative story opening.'}],
options={
'temperature': 0.8, # Higher = more creative (0-2)
'top_p': 0.9, # Nucleus sampling
'top_k': 40, # Top-k sampling
'num_predict': 100, # Max tokens to generate
'repeat_penalty': 1.1, # Penalize repetition
}
)
print(response['message']['content'])
8. CLI Usage (from terminal)¶
Ollama also has a powerful CLI:
# Interactive chat
ollama run llama3.2
# One-off prompt
ollama run llama3.2 "Explain quantum computing"
# With parameters
ollama run llama3.2 --temperature 0.8 "Write a poem"
# List models
ollama list
# Pull model
ollama pull mistral
# Delete model
ollama rm mistral
# Show model info
ollama show llama3.2
Tips & Best Practices¶
Choosing a Model (December 2025)¶
For speed (< 4GB RAM):
llama3.2:1b- Fastestllama3.2:3b- Better quality, still fast
For quality (8-16GB RAM):
qwen2.5:14b- Best multilingualphi-4- Excellent reasoning & codingdeepseek-r1:14b- Best for math & reasoningmistral- Solid general purpose
For best quality (32GB+ RAM):
llama3.3:70b- Meta’s bestqwen2.5:72b- Best multilingual modeldeepseek-r1:70b- Best reasoning modelgemma2:27b- Google’s flagship
For coding:
qwen2.5-coder:32b- Best code model (Dec 2025)qwen2.5-coder:7b- Fast, good qualityphi-4- Excellent for coding + reasoningdeepseek-coder- Alternative
Performance¶
# Reduce memory usage
options = {
'num_ctx': 2048, # Smaller context window (default 4096)
}
# Faster generation (less quality)
options = {
'temperature': 0.0, # Greedy decoding
'top_k': 1,
'num_predict': 50, # Limit length
}
# Better quality (slower)
options = {
'temperature': 0.7,
'top_p': 0.95,
'repeat_penalty': 1.2,
}
Common Issues¶
Slow first response?
Model loads into memory on first use
Keep Ollama running:
ollama serve
Out of memory?
Use smaller model (1B, 3B, or 7B)
Reduce
num_ctxClose other applications
Try quantized versions (Q4 or Q5)
Poor quality?
Try larger model (14B, 32B, or 70B)
Use newer models (qwen2.5, phi-4, deepseek-r1)
Improve prompts
Adjust temperature
Exercise: Build a Local Chatbot¶
Combine the concepts above into a complete local chatbot with a configurable personality and conversation memory. The LocalChatbot class wraps Ollama’s chat API with a system prompt that defines the bot’s name and behavior, and maintains conversation history across turns. Try creating bots with different personas (code reviewer, writing coach, domain expert) and compare how the system prompt affects response style and content.
class LocalChatbot:
def __init__(self, name="Assistant", personality="helpful and friendly", model='qwen2.5:7b'):
self.name = name
self.model = model
self.messages = [{
'role': 'system',
'content': f"You are {name}, a {personality} assistant."
}]
def chat(self, user_input):
"""Send message and get response."""
self.messages.append({
'role': 'user',
'content': user_input
})
response = ollama.chat(
model=self.model,
messages=self.messages
)
reply = response['message']['content']
self.messages.append({
'role': 'assistant',
'content': reply
})
return reply
# Create bot with latest model
bot = LocalChatbot(
name="CodeHelper",
personality="expert Python programmer who explains concepts clearly",
model="phi-4" # Best for coding (Dec 2025)
)
# Test conversation
print(bot.chat("What is a decorator in Python?"))
print("\n" + "="*50 + "\n")
print(bot.chat("Can you show me an example?"))
Key Takeaways¶
Privacy: All processing happens locally
Cost: Zero API fees
Easy: Download and run in minutes
Fast: Low latency for real-time apps
Offline: Works without internet
Latest Models: Access to Llama 3.3, Qwen 2.5, Phi-4, DeepSeek-R1 (Dec 2025)
Limitations¶
Requires good hardware (8GB+ RAM recommended, 16GB+ for best models)
Smaller models less capable than GPT-4 or Claude 3.5
Slower than cloud APIs on weak hardware
Limited to model’s knowledge cutoff
Next Steps¶
02_model_comparison.ipynb- Compare different models03_quantization.ipynb- Understand model formats04_custom_models.ipynb- Create custom Modelfiles05_rag_local.ipynb- Build RAG with local embeddings07_production_api.ipynb- Deploy as API server