Setup

1. Install Ollama

macOS/Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download from https://ollama.com/download

### 2. Install Python Client
# Install Ollama Python library
# !pip install ollama
import ollama
import json
from pprint import pprint

print("Ollama client ready!")

1. Download and Run a Model

Available Models (December 2025)

Model

Size

Best For

RAM Needed

llama3.3

70B

Best quality

48GB

llama3.2

3B

Fast, general

4GB

llama3.2

1B

Tiny, fast

2GB

qwen2.5

72B

Multilingual, best

48GB

qwen2.5

14B

Multilingual, balanced

16GB

qwen2.5

7B

Multilingual

8GB

qwen2.5-coder

32B

Code generation

24GB

qwen2.5-coder

7B

Code

8GB

deepseek-r1

70B

Reasoning, math

48GB

deepseek-r1

14B

Reasoning

16GB

phi-4

14B

Coding, math, reasoning

16GB

mistral

7B

Balanced

8GB

gemma2

27B

Google’s best

24GB

gemma2

9B

Google, balanced

12GB

# Pull a model (first time only - downloads to ~/.ollama)
# This happens automatically on first use, but you can pre-download:

# ollama.pull('llama3.2')        # Fast 3B model
# ollama.pull('phi-4')            # Best for coding & reasoning (Dec 2025)
# ollama.pull('qwen2.5-coder')    # Excellent for code
# ollama.pull('deepseek-r1:14b')  # Best for reasoning tasks
# ollama.pull('gemma2:9b')        # Google's balanced model
# List downloaded models
models = ollama.list()
print("Available models:")
for model in models.get('models', []):
    name = model['name']
    size = model.get('size', 0) / 1e9  # Convert to GB
    print(f"  - {name} ({size:.1f} GB)")

2. Basic Chat

The ollama.chat() function sends a prompt to a locally running model and returns the complete response. Unlike cloud APIs, there is no network latency or rate limiting – the bottleneck is your hardware’s inference speed. The streaming variant (stream=True) yields tokens as they are generated, providing a responsive user experience similar to ChatGPT. The messages parameter uses the same format as the OpenAI API (role/content pairs), making it easy to migrate code between local and cloud models.

def chat(prompt, model='llama3.2'):
    """Simple chat interface."""
    response = ollama.chat(
        model=model,
        messages=[{
            'role': 'user',
            'content': prompt
        }]
    )
    return response['message']['content']

# Test
answer = chat("Explain what a neural network is in one sentence.")
print(answer)
# With streaming for real-time output
def chat_stream(prompt, model='llama3.2'):
    """Chat with streaming response."""
    stream = ollama.chat(
        model=model,
        messages=[{'role': 'user', 'content': prompt}],
        stream=True
    )
    
    full_response = ""
    for chunk in stream:
        content = chunk['message']['content']
        print(content, end='', flush=True)
        full_response += content
    print()  # Newline
    return full_response

# Test streaming
response = chat_stream("Write a haiku about AI.")

3. Conversation with Context

LLMs are stateless – each API call is independent. To maintain a multi-turn conversation, you must send the full message history with every request. The Conversation class below manages this history automatically, appending each user message and assistant response to a growing list. An optional system prompt sets the model’s persona and behavior guidelines. Keep in mind that longer histories consume more context tokens and slow down generation, so for extended conversations you may need to implement summarization or sliding-window truncation.

# Multi-turn conversation
messages = [
    {'role': 'user', 'content': 'What is the capital of France?'},
    {'role': 'assistant', 'content': 'The capital of France is Paris.'},
    {'role': 'user', 'content': 'What is its population?'}
]

response = ollama.chat(
    model='llama3.2',
    messages=messages
)

print(response['message']['content'])
# Conversation class
class Conversation:
    def __init__(self, model='llama3.2', system_prompt=None):
        self.model = model
        self.messages = []
        
        if system_prompt:
            self.messages.append({
                'role': 'system',
                'content': system_prompt
            })
    
    def send(self, message):
        """Send message and get response."""
        self.messages.append({
            'role': 'user',
            'content': message
        })
        
        response = ollama.chat(
            model=self.model,
            messages=self.messages
        )
        
        assistant_message = response['message']['content']
        self.messages.append({
            'role': 'assistant',
            'content': assistant_message
        })
        
        return assistant_message
    
    def reset(self):
        """Clear conversation history."""
        self.messages = []

# Use it
conv = Conversation(
    model='llama3.2',
    system_prompt="You are a helpful Python programming assistant."
)

print(conv.send("What is a list comprehension?"))
print("\n" + "="*50 + "\n")
print(conv.send("Show me an example."))

4. Code Generation

Specialized code models like CodeLlama, DeepSeek-Coder, and Phi-4 are trained on large code corpora and significantly outperform general-purpose models on programming tasks. The generate_code() function below uses a structured prompt that specifies the programming language and requests only code output (no explanations), which produces cleaner results. Running code generation locally is especially valuable for enterprise environments where proprietary code cannot be sent to external APIs.

# Use code-specialized model
def generate_code(task, language='python'):
    """Generate code for a task."""
    prompt = f"""Write {language} code to {task}.
Only output the code, no explanations.

Code:"""
    
    response = ollama.chat(
        model='codellama',  # or 'deepseek-coder', 'phi-3'
        messages=[{'role': 'user', 'content': prompt}]
    )
    
    return response['message']['content']

# Test
code = generate_code("calculate fibonacci numbers using dynamic programming")
print(code)

5. Structured Output (JSON)

Extracting structured data from unstructured text is one of the most practical LLM applications. Ollama’s format='json' parameter constrains the model’s output to valid JSON, eliminating parsing errors caused by stray text or markdown formatting. The extract_json() function takes a schema description and input text, then returns a Python dictionary. This approach is the local equivalent of OpenAI’s structured output mode and works well for information extraction, form filling, and data normalization tasks.

# Extract structured data
def extract_json(text, schema_description):
    """Extract structured information as JSON."""
    prompt = f"""Extract the following information from the text and return as JSON:
{schema_description}

Text: {text}

JSON:"""
    
    response = ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': prompt}],
        format='json'  # Force JSON output
    )
    
    return json.loads(response['message']['content'])

# Test
text = """John Smith works at Acme Corp as a Senior Engineer. 
He can be reached at john.smith@acme.com or 555-1234."""

schema = """{
  "name": "full name",
  "company": "company name",
  "title": "job title",
  "email": "email address",
  "phone": "phone number"
}"""

result = extract_json(text, schema)
pprint(result)

6. Embeddings for RAG

Ollama can also run embedding models locally, converting text into dense vector representations for similarity search. The nomic-embed-text model is a compact embedding model optimized for retrieval tasks. By generating embeddings locally, you can build a fully private RAG pipeline where neither the documents nor the queries ever leave your machine. The cosine similarity search below demonstrates the core retrieval step: encode the query and all documents, then find the document whose embedding is closest to the query embedding in vector space.

# Generate embeddings locally
def get_embedding(text):
    """Get text embedding."""
    response = ollama.embeddings(
        model='nomic-embed-text',  # Specialized embedding model
        prompt=text
    )
    return response['embedding']

# Test
embedding = get_embedding("Machine learning is awesome!")
print(f"Embedding dimension: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")
# Simple similarity search
import numpy as np

def cosine_similarity(a, b):
    """Compute cosine similarity."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Documents
docs = [
    "Python is a programming language.",
    "Machine learning uses neural networks.",
    "The weather is sunny today.",
    "Deep learning is a subset of AI."
]

# Get embeddings
doc_embeddings = [get_embedding(doc) for doc in docs]

# Query
query = "Tell me about artificial intelligence"
query_embedding = get_embedding(query)

# Find most similar
similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
best_idx = np.argmax(similarities)

print(f"Query: {query}")
print(f"\nMost relevant document:")
print(f"  {docs[best_idx]}")
print(f"  Similarity: {similarities[best_idx]:.4f}")

7. Model Parameters

Generation behavior can be fine-tuned through sampling parameters passed via the options dictionary. Temperature controls randomness: 0.0 produces deterministic, greedy output while values above 1.0 increase creativity at the risk of incoherence. top_p (nucleus sampling) and top_k further control the token selection distribution. repeat_penalty discourages the model from repeating phrases, which is a common issue in local models. These parameters are passed directly to the model’s inference engine and take effect immediately – no reloading required.

# Custom parameters
response = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Write a creative story opening.'}],
    options={
        'temperature': 0.8,      # Higher = more creative (0-2)
        'top_p': 0.9,           # Nucleus sampling
        'top_k': 40,            # Top-k sampling
        'num_predict': 100,     # Max tokens to generate
        'repeat_penalty': 1.1,  # Penalize repetition
    }
)

print(response['message']['content'])

8. CLI Usage (from terminal)

Ollama also has a powerful CLI:

# Interactive chat
ollama run llama3.2

# One-off prompt
ollama run llama3.2 "Explain quantum computing"

# With parameters
ollama run llama3.2 --temperature 0.8 "Write a poem"

# List models
ollama list

# Pull model
ollama pull mistral

# Delete model
ollama rm mistral

# Show model info
ollama show llama3.2

Tips & Best Practices

Choosing a Model (December 2025)

For speed (< 4GB RAM):

  • llama3.2:1b - Fastest

  • llama3.2:3b - Better quality, still fast

For quality (8-16GB RAM):

  • qwen2.5:14b - Best multilingual

  • phi-4 - Excellent reasoning & coding

  • deepseek-r1:14b - Best for math & reasoning

  • mistral - Solid general purpose

For best quality (32GB+ RAM):

  • llama3.3:70b - Meta’s best

  • qwen2.5:72b - Best multilingual model

  • deepseek-r1:70b - Best reasoning model

  • gemma2:27b - Google’s flagship

For coding:

  • qwen2.5-coder:32b - Best code model (Dec 2025)

  • qwen2.5-coder:7b - Fast, good quality

  • phi-4 - Excellent for coding + reasoning

  • deepseek-coder - Alternative

Performance

# Reduce memory usage
options = {
    'num_ctx': 2048,  # Smaller context window (default 4096)
}

# Faster generation (less quality)
options = {
    'temperature': 0.0,  # Greedy decoding
    'top_k': 1,
    'num_predict': 50,   # Limit length
}

# Better quality (slower)
options = {
    'temperature': 0.7,
    'top_p': 0.95,
    'repeat_penalty': 1.2,
}

Common Issues

Slow first response?

  • Model loads into memory on first use

  • Keep Ollama running: ollama serve

Out of memory?

  • Use smaller model (1B, 3B, or 7B)

  • Reduce num_ctx

  • Close other applications

  • Try quantized versions (Q4 or Q5)

Poor quality?

  • Try larger model (14B, 32B, or 70B)

  • Use newer models (qwen2.5, phi-4, deepseek-r1)

  • Improve prompts

  • Adjust temperature

Exercise: Build a Local Chatbot

Combine the concepts above into a complete local chatbot with a configurable personality and conversation memory. The LocalChatbot class wraps Ollama’s chat API with a system prompt that defines the bot’s name and behavior, and maintains conversation history across turns. Try creating bots with different personas (code reviewer, writing coach, domain expert) and compare how the system prompt affects response style and content.

class LocalChatbot:
    def __init__(self, name="Assistant", personality="helpful and friendly", model='qwen2.5:7b'):
        self.name = name
        self.model = model
        self.messages = [{
            'role': 'system',
            'content': f"You are {name}, a {personality} assistant."
        }]
    
    def chat(self, user_input):
        """Send message and get response."""
        self.messages.append({
            'role': 'user',
            'content': user_input
        })
        
        response = ollama.chat(
            model=self.model,
            messages=self.messages
        )
        
        reply = response['message']['content']
        self.messages.append({
            'role': 'assistant',
            'content': reply
        })
        
        return reply

# Create bot with latest model
bot = LocalChatbot(
    name="CodeHelper",
    personality="expert Python programmer who explains concepts clearly",
    model="phi-4"  # Best for coding (Dec 2025)
)

# Test conversation
print(bot.chat("What is a decorator in Python?"))
print("\n" + "="*50 + "\n")
print(bot.chat("Can you show me an example?"))

Key Takeaways

  1. Privacy: All processing happens locally

  2. Cost: Zero API fees

  3. Easy: Download and run in minutes

  4. Fast: Low latency for real-time apps

  5. Offline: Works without internet

  6. Latest Models: Access to Llama 3.3, Qwen 2.5, Phi-4, DeepSeek-R1 (Dec 2025)

Limitations

  • Requires good hardware (8GB+ RAM recommended, 16GB+ for best models)

  • Smaller models less capable than GPT-4 or Claude 3.5

  • Slower than cloud APIs on weak hardware

  • Limited to model’s knowledge cutoff

Next Steps

  • 02_model_comparison.ipynb - Compare different models

  • 03_quantization.ipynb - Understand model formats

  • 04_custom_models.ipynb - Create custom Modelfiles

  • 05_rag_local.ipynb - Build RAG with local embeddings

  • 07_production_api.ipynb - Deploy as API server