Notebook 08: Working with Reasoning ModelsΒΆ
o3, DeepSeek R1, and Claude Extended ThinkingΒΆ
What Youβll LearnΒΆ
What are reasoning models? - Fast vs slow thinking, test-time compute scaling
OpenAI o-series - o1, o3, o3-mini, o4-mini with
reasoning_effortDeepSeek R1 - Open-source reasoning via API and locally with Ollama
Claude Extended Thinking - Budget tokens and thinking blocks
When to use reasoning models - Cost-benefit analysis
Practical comparison - Same problem across models
Prompt engineering for reasoning models
Benchmark comparison table - AIME, MATH-500, HumanEval, SWE-bench
Prerequisites: OpenAI API key, Anthropic API key (optional), Ollama installed (optional)
# Install required packages
!pip install openai anthropic python-dotenv ollama -q
import os
import time
import json
import re
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
print("OpenAI API key:", "SET" if OPENAI_API_KEY else "NOT SET - some cells will be skipped")
print("Anthropic API key:", "SET" if ANTHROPIC_API_KEY else "NOT SET - some cells will be skipped")
Part 1: What Are Reasoning Models?ΒΆ
System 1 vs System 2 ThinkingΒΆ
Psychologist Daniel Kahnemanβs framework maps perfectly onto LLM architectures:
System 1 (Fast Thinking) |
System 2 (Slow Thinking) |
|
|---|---|---|
Speed |
Instant |
Deliberate |
Effort |
Automatic |
Effortful |
LLM Example |
GPT-4o, Claude 3.5 Sonnet |
o3, DeepSeek R1, Claude Extended Thinking |
Good For |
Chat, summarization, classification |
Math, code, logic, planning |
Failure Mode |
Wrong on hard problems |
Slow and expensive |
Test-Time Compute ScalingΒΆ
Traditional scaling: train bigger models with more data and parameters.
Test-time compute scaling: spend more compute at inference time to improve accuracy.
Standard Model: [User Prompt] --> [Single Forward Pass] --> [Answer]
Reasoning Model: [User Prompt] --> [Think: step 1... step 2... step 3...] --> [Answer]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Thinking tokens (billed but not always shown)
Chain-of-Thought in Latent SpaceΒΆ
Classic chain-of-thought (CoT) prompts models to βthink step by stepβ in the output. Reasoning models do this internally via special βthinking tokensβ before producing the final answer.
OpenAI o-series: Thinking tokens are hidden (you pay for them but cannot read them)
DeepSeek R1: Thinking is exposed inside
<think>...</think>XML tagsClaude Extended Thinking: Thinking is returned as separate
thinkingcontent blocks
Why They Outperform Standard ModelsΒΆ
Self-verification: The model checks its own work before committing to an answer
Backtracking: If a reasoning path fails, the model tries a different approach
Deeper decomposition: Complex problems are broken into sub-problems
Reduced hallucination: Structured reasoning catches logical errors
# Demonstration: Standard model vs reasoning model on a hard problem
# This cell illustrates the concept even without an API key
HARD_MATH_PROBLEM = """
A farmer has 17 sheep. All but 9 die. How many sheep are left?
"""
# Classic "trick" problem - standard models often answer 8 (wrong)
# Reasoning models parse "all but 9" correctly = 9 remain
print("Problem:", HARD_MATH_PROBLEM.strip())
print()
print("Common wrong answer: 8 (model calculates 17 - 9 = 8)")
print("Correct answer: 9 ('all but 9' means 9 survive)")
print()
print("Reasoning model approach:")
print(" <think>")
print(" The phrase 'all but 9 die' means 9 sheep do NOT die.")
print(" Therefore 9 sheep are left alive.")
print(" The total count of 17 is a distractor.")
print(" </think>")
print(" Answer: 9")
Part 2: OpenAI Reasoning Models (o1, o3, o3-mini, o4-mini)ΒΆ
Model LineupΒΆ
Model |
Release |
Strengths |
Cost (approx) |
|---|---|---|---|
o1 |
Sep 2024 |
Best accuracy, frontier |
\(15 / \)60 per M tokens |
o1-mini |
Sep 2024 |
Faster, cheaper |
\(3 / \)12 per M tokens |
o3 |
Apr 2025 |
Successor to o1, top AIME |
\(10 / \)40 per M tokens |
o3-mini |
Jan 2025 |
Best value reasoning |
\(1.10 / \)4.40 per M tokens |
o4-mini |
Apr 2025 |
Latest small, vision |
\(1.10 / \)4.40 per M tokens |
The reasoning_effort ParameterΒΆ
Unlike standard models, o-series models expose a reasoning_effort knob:
low- Fastest, least thinking. Good for simple reasoning tasks.medium- Balanced. Default for most use cases.high- Maximum reasoning. Best accuracy on hard problems, most expensive.
Key LimitationsΒΆ
No streaming of thinking tokens (only final answer can be streamed)
System prompt restrictions on some older o-series models (use
developerrole)Reasoning tokens are billed in addition to input/output tokens
No
temperaturesupport (model manages its own exploration)
from openai import OpenAI
def call_o3_mini(problem: str, effort: str = "medium") -> dict:
"""
Call o3-mini with a specified reasoning_effort level.
Returns response text, token usage, and latency.
"""
if not OPENAI_API_KEY:
print("[SKIP] OPENAI_API_KEY not set.")
return {"text": None, "usage": None, "latency": None}
client = OpenAI(api_key=OPENAI_API_KEY)
start = time.time()
response = client.chat.completions.create(
model="o3-mini",
reasoning_effort=effort, # "low" | "medium" | "high"
messages=[
{"role": "user", "content": problem}
]
)
latency = round(time.time() - start, 2)
message = response.choices[0].message.content
usage = response.usage
return {
"text": message,
"latency": latency,
"input_tokens": usage.prompt_tokens,
"output_tokens": usage.completion_tokens,
"reasoning_tokens": getattr(usage.completion_tokens_details, "reasoning_tokens", "N/A"),
}
# --- Example problem ---
LOGIC_PROBLEM = """
There are 5 houses in a row. Each house is painted a different color and
inhabited by a person of a different nationality, who drinks a different
beverage, smokes a different brand of cigars, and keeps a different pet.
Clues:
1. The Brit lives in the red house.
2. The Swede keeps dogs as pets.
3. The Dane drinks tea.
4. The green house is on the left of the white house.
5. The green house owner drinks coffee.
6. The person who smokes Pall Mall rears birds.
7. The owner of the yellow house smokes Dunhill.
8. The man living in the center house drinks milk.
9. The Norwegian lives in the first house.
10. The man who smokes Blends lives next to the one who keeps cats.
11. The man who keeps horses lives next to the man who smokes Dunhill.
12. The owner who smokes BlueMaster drinks beer.
13. The German smokes Prince.
14. The Norwegian lives next to the blue house.
15. The man who smokes Blends has a neighbor who drinks water.
Who owns the fish?
"""
print("Einstein's Riddle (Zebra Puzzle) loaded.")
print("This is a classic logic puzzle that requires systematic reasoning.")
print("We will test it with different reasoning_effort levels.")
# Test with low reasoning effort
print("=" * 60)
print("o3-mini with reasoning_effort='low'")
print("=" * 60)
result_low = call_o3_mini(LOGIC_PROBLEM, effort="low")
if result_low["text"]:
print(f"Answer: {result_low['text'][:500]}...")
print(f"\nLatency: {result_low['latency']}s")
print(f"Input tokens: {result_low['input_tokens']}")
print(f"Output tokens: {result_low['output_tokens']}")
print(f"Reasoning tokens: {result_low['reasoning_tokens']}")
# Test with high reasoning effort
print("=" * 60)
print("o3-mini with reasoning_effort='high'")
print("=" * 60)
result_high = call_o3_mini(LOGIC_PROBLEM, effort="high")
if result_high["text"]:
print(f"Answer: {result_high['text'][:500]}...")
print(f"\nLatency: {result_high['latency']}s")
print(f"Input tokens: {result_high['input_tokens']}")
print(f"Output tokens: {result_high['output_tokens']}")
print(f"Reasoning tokens: {result_high['reasoning_tokens']}")
# Compare
if result_low["latency"]:
print(f"\n--- Effort Comparison ---")
print(f"low reasoning_tokens: {result_low['reasoning_tokens']} | latency: {result_low['latency']}s")
print(f"high reasoning_tokens: {result_high['reasoning_tokens']} | latency: {result_high['latency']}s")
# Cost calculator for reasoning tokens
# o3-mini pricing (as of 2025): $1.10 per M input, $4.40 per M output/reasoning
def estimate_cost_o3_mini(input_tokens: int, output_tokens: int, reasoning_tokens: int) -> float:
"""
Estimate cost for o3-mini.
Pricing: $1.10/M input, $4.40/M output (reasoning tokens billed as output).
"""
input_cost = (input_tokens / 1_000_000) * 1.10
output_cost = ((output_tokens + reasoning_tokens) / 1_000_000) * 4.40
return input_cost + output_cost
# Simulate a batch of 1000 requests
print("Cost simulation: 1000 identical o3-mini requests")
print("=" * 50)
scenarios = [
("low", 300, 100, 200),
("medium", 300, 150, 800),
("high", 300, 200, 3000),
]
for effort, inp, out, reason in scenarios:
cost_per_req = estimate_cost_o3_mini(inp, out, reason)
cost_1000 = cost_per_req * 1000
print(f"reasoning_effort='{effort}':")
print(f" Estimated reasoning tokens per req : {reason}")
print(f" Cost per request : ${cost_per_req:.5f}")
print(f" Cost for 1000 requests : ${cost_1000:.2f}")
print()
Best Use Cases for OpenAI o-seriesΒΆ
Use Case |
Reasoning Effort |
Why |
|---|---|---|
Math olympiad / AIME |
high |
Requires deep multi-step proof |
LeetCode hard / competitive programming |
high |
Subtle edge cases |
Complex debugging (multi-file) |
medium-high |
Root cause analysis |
SQL query optimization |
medium |
Logical planning |
Scientific hypothesis checking |
medium |
Literature reasoning |
Simple arithmetic / classification |
Skip o-series |
Standard GPT-4o is cheaper |
Creative writing |
Skip o-series |
Reasoning doesnβt help creativity |
Part 3: DeepSeek R1 - Open Source ReasoningΒΆ
Why DeepSeek R1 MattersΒΆ
DeepSeek R1 was released in January 2025 and shocked the AI community:
Matches o1 on benchmarks at a fraction of training cost
Fully open weights (MIT license) - you can run it locally
Thinking is visible -
<think>tags expose the full reasoning traceGRPO training - uses Group Relative Policy Optimization instead of PPO
Model Sizes (Distilled)ΒΆ
R1 knowledge was distilled into smaller Qwen/Llama base models:
Model |
Parameters |
VRAM Required |
Use Case |
|---|---|---|---|
deepseek-r1:1.5b |
1.5B |
~2 GB |
Embedded, mobile |
deepseek-r1:7b |
7B |
~6 GB |
Laptop GPU |
deepseek-r1:8b |
8B |
~8 GB |
Laptop GPU |
deepseek-r1:14b |
14B |
~12 GB |
Desktop GPU |
deepseek-r1:32b |
32B |
~24 GB |
High-end GPU |
deepseek-r1:70b |
70B |
~48 GB |
Multi-GPU |
deepseek-r1 (full) |
671B |
~400 GB |
Cluster |
GRPO Training (Brief Overview)ΒΆ
Standard RLHF uses PPO (Proximal Policy Optimization) which requires a separate critic model. DeepSeek R1 used GRPO (Group Relative Policy Optimization):
PPO: Model --> Critic (separate) --> Reward --> Update
GRPO: Model --> Sample group of answers --> Rank them --> Update using relative rewards
GRPO eliminates the critic model, cutting memory and compute requirements by ~50%. Full R1 training details are covered in Phase 12 (RLHF).
# --- Option A: DeepSeek R1 via OpenRouter (cloud API) ---
# OpenRouter provides unified access to many models including DeepSeek R1.
# API is OpenAI-compatible - just change the base_url and model name.
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
def call_deepseek_r1_openrouter(problem: str) -> dict:
"""
Call DeepSeek R1 via OpenRouter.
OpenRouter API is OpenAI-compatible.
Sign up at https://openrouter.ai to get a free API key.
"""
if not OPENROUTER_API_KEY:
print("[SKIP] OPENROUTER_API_KEY not set. Get one free at https://openrouter.ai")
return {"text": None, "think": None, "latency": None}
client = OpenAI(
api_key=OPENROUTER_API_KEY,
base_url="https://openrouter.ai/api/v1"
)
start = time.time()
response = client.chat.completions.create(
model="deepseek/deepseek-r1",
messages=[{"role": "user", "content": problem}]
)
latency = round(time.time() - start, 2)
full_text = response.choices[0].message.content
think_text, answer_text = parse_think_tags(full_text)
return {
"text": answer_text,
"think": think_text,
"latency": latency
}
def parse_think_tags(text: str) -> tuple[str, str]:
"""
Parse <think>...</think> tags from DeepSeek R1 output.
Returns (thinking_content, final_answer).
"""
if text is None:
return "", ""
think_match = re.search(r"<think>(.*?)</think>", text, re.DOTALL)
think_content = think_match.group(1).strip() if think_match else ""
# Answer is everything after the closing </think> tag
answer = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
return think_content, answer
# Test the parser on a synthetic example
sample_r1_output = """
<think>
Let me think about this step by step.
The problem asks for the sum of the first 10 natural numbers.
Using the formula n*(n+1)/2 where n=10:
10 * 11 / 2 = 55
</think>
The sum of the first 10 natural numbers is **55**.
"""
think, answer = parse_think_tags(sample_r1_output)
print("Parsed <think> content:")
print(think)
print("\nParsed final answer:")
print(answer)
# Call DeepSeek R1 via OpenRouter on our logic problem
MATH_PROBLEM_SIMPLE = """
A train leaves City A at 9:00 AM traveling at 60 mph toward City B.
Another train leaves City B at 10:00 AM traveling at 80 mph toward City A.
The cities are 280 miles apart. At what time will the trains meet?
"""
print("Problem:", MATH_PROBLEM_SIMPLE.strip())
print()
result_r1_cloud = call_deepseek_r1_openrouter(MATH_PROBLEM_SIMPLE)
if result_r1_cloud["text"]:
print("=" * 60)
print("DeepSeek R1 Thinking Process:")
print("=" * 60)
print(result_r1_cloud["think"][:1000] if result_r1_cloud["think"] else "(thinking hidden)")
print()
print("=" * 60)
print("Final Answer:")
print("=" * 60)
print(result_r1_cloud["text"])
print(f"\nLatency: {result_r1_cloud['latency']}s")
# --- Option B: DeepSeek R1 via Ollama (local, fully private) ---
# Requirements:
# 1. Install Ollama: https://ollama.ai
# 2. Pull the model: ollama pull deepseek-r1:7b
# 3. Ollama must be running (it starts automatically on install)
try:
import ollama
OLLAMA_AVAILABLE = True
except ImportError:
OLLAMA_AVAILABLE = False
print("ollama package not installed. Run: pip install ollama")
def call_deepseek_r1_ollama(problem: str, model: str = "deepseek-r1:7b") -> dict:
"""
Call DeepSeek R1 locally via Ollama.
First time setup:
brew install ollama # macOS
ollama pull deepseek-r1:7b # ~5 GB download
Available sizes: 1.5b, 7b, 8b, 14b, 32b, 70b
"""
if not OLLAMA_AVAILABLE:
print("[SKIP] ollama not installed.")
return {"text": None, "think": None, "latency": None}
# Check if Ollama daemon is running
try:
models = ollama.list()
available_models = [m["name"] for m in models.get("models", [])]
except Exception:
print("[SKIP] Ollama daemon not running. Start with: ollama serve")
return {"text": None, "think": None, "latency": None}
if not any(model.split(":")[0] in m for m in available_models):
print(f"[SKIP] Model '{model}' not found locally.")
print(f" Pull it with: ollama pull {model}")
print(f" Available models: {available_models}")
return {"text": None, "think": None, "latency": None}
start = time.time()
response = ollama.chat(
model=model,
messages=[{"role": "user", "content": problem}]
)
latency = round(time.time() - start, 2)
full_text = response["message"]["content"]
think_content, answer = parse_think_tags(full_text)
return {
"text": answer,
"think": think_content,
"full_response": full_text,
"latency": latency
}
# Run locally
result_r1_local = call_deepseek_r1_ollama(MATH_PROBLEM_SIMPLE, model="deepseek-r1:7b")
if result_r1_local["text"]:
print("DeepSeek R1 7B (local via Ollama)")
print("=" * 60)
print("Thinking:")
print(result_r1_local["think"][:800])
print("\nAnswer:")
print(result_r1_local["text"])
print(f"\nLatency: {result_r1_local['latency']}s (local inference)")
# Advanced: Streaming DeepSeek R1 from Ollama
# Streaming lets you see the <think> tokens as they arrive
def stream_deepseek_r1_ollama(problem: str, model: str = "deepseek-r1:7b"):
"""
Stream DeepSeek R1 output from Ollama.
Prints thinking tokens in real-time.
"""
if not OLLAMA_AVAILABLE:
print("[SKIP] ollama not installed.")
return
try:
stream = ollama.chat(
model=model,
messages=[{"role": "user", "content": problem}],
stream=True
)
print("Streaming response (including <think> tokens):")
print("-" * 40)
in_think = False
buffer = ""
for chunk in stream:
token = chunk["message"]["content"]
buffer += token
# Color-code thinking vs answer
if "<think>" in buffer:
in_think = True
if "</think>" in buffer:
in_think = False
prefix = "[THINK] " if in_think else "[ANS] "
print(token, end="", flush=True)
print("\n" + "-" * 40)
except Exception as e:
print(f"[SKIP] Streaming failed: {e}")
# Uncomment to stream (requires Ollama + model)
# stream_deepseek_r1_ollama("What is 15 factorial?")
print("Streaming function defined. Uncomment the last line to test with a running Ollama instance.")
Part 4: Anthropic Claude Extended Thinking (Claude Opus 4.6)ΒΆ
How Claude Extended Thinking WorksΒΆ
Claudeβs extended thinking is activated by passing a thinking configuration block:
thinking={"type": "enabled", "budget_tokens": N}
The response will contain a list of content blocks, some of type "thinking" and others of type "text".
Budget TokensΒΆ
Minimum: 1,024 tokens
Maximum: 100,000 tokens (with max_tokens >= budget_tokens + 1)
Recommended starting point: 5,000-10,000 for most hard problems
When to increase: If the model says βI need more space to thinkβ or gives incorrect answers
Thinking tokens are billed at the same rate as output tokens.
Key Differences from o-seriesΒΆ
Feature |
OpenAI o-series |
Claude Extended Thinking |
|---|---|---|
Thinking visible? |
No (hidden) |
Yes (thinking blocks) |
Control granularity |
|
|
Streaming |
Final answer only |
Thinking + answer streamable |
System prompt |
Restricted on older models |
Normal support |
import anthropic
def call_claude_extended_thinking(
problem: str,
budget_tokens: int = 10000,
model: str = "claude-opus-4-6"
) -> dict:
"""
Call Claude with extended thinking enabled.
Returns thinking blocks and final answer separately.
Args:
problem: The problem to solve
budget_tokens: Max tokens Claude can use for thinking (min 1024)
model: Claude model - must support extended thinking (claude-opus-4-6)
"""
if not ANTHROPIC_API_KEY:
print("[SKIP] ANTHROPIC_API_KEY not set.")
return {"thinking": None, "answer": None, "latency": None}
client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
# max_tokens must be > budget_tokens to leave room for the final answer
max_tokens = budget_tokens + 4000
start = time.time()
response = client.messages.create(
model=model,
max_tokens=max_tokens,
thinking={
"type": "enabled",
"budget_tokens": budget_tokens
},
messages=[
{"role": "user", "content": problem}
]
)
latency = round(time.time() - start, 2)
# Separate thinking blocks from text blocks
thinking_blocks = []
text_blocks = []
for block in response.content:
if block.type == "thinking":
thinking_blocks.append(block.thinking)
elif block.type == "text":
text_blocks.append(block.text)
return {
"thinking": "\n\n".join(thinking_blocks),
"answer": "\n\n".join(text_blocks),
"latency": latency,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
# cache_read_input_tokens available in newer SDK versions
"stop_reason": response.stop_reason
}
print("Claude extended thinking function defined.")
# Test Claude extended thinking on a hard math problem
HARD_MATH = """
Find all integer solutions (x, y) to the equation:
x^2 - y^2 = 2024
Show all solutions and prove there are no others.
"""
print("Problem:", HARD_MATH.strip())
print()
result_claude = call_claude_extended_thinking(HARD_MATH, budget_tokens=8000)
if result_claude["thinking"]:
print("=" * 60)
print("Claude's Thinking (first 800 chars):")
print("=" * 60)
print(result_claude["thinking"][:800])
print()
print("=" * 60)
print("Final Answer:")
print("=" * 60)
print(result_claude["answer"])
print(f"\nLatency: {result_claude['latency']}s")
print(f"Input tokens: {result_claude['input_tokens']}")
print(f"Output tokens: {result_claude['output_tokens']} (includes thinking)")
# Streaming Claude Extended Thinking
# With streaming, thinking blocks arrive as they are generated
def stream_claude_extended_thinking(problem: str, budget_tokens: int = 5000):
"""
Stream Claude extended thinking response.
Thinking blocks stream as they are generated.
"""
if not ANTHROPIC_API_KEY:
print("[SKIP] ANTHROPIC_API_KEY not set.")
return
client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
print("Streaming extended thinking response...")
print("-" * 40)
with client.messages.stream(
model="claude-opus-4-6",
max_tokens=budget_tokens + 4000,
thinking={
"type": "enabled",
"budget_tokens": budget_tokens
},
messages=[{"role": "user", "content": problem}]
) as stream:
current_block_type = None
for event in stream:
event_type = type(event).__name__
if event_type == "ContentBlockStart":
block = event.content_block
current_block_type = block.type
if block.type == "thinking":
print("\n[THINKING BLOCK STARTED]")
elif block.type == "text":
print("\n[ANSWER BLOCK STARTED]")
elif event_type == "ContentBlockDelta":
delta = event.delta
if hasattr(delta, "thinking"):
print(delta.thinking, end="", flush=True)
elif hasattr(delta, "text"):
print(delta.text, end="", flush=True)
elif event_type == "ContentBlockStop":
if current_block_type == "thinking":
print("\n[THINKING BLOCK ENDED]")
elif current_block_type == "text":
print("\n[ANSWER BLOCK ENDED]")
print("-" * 40)
# Test streaming (requires Anthropic API key)
SIMPLE_REASONING = "What is the 20th Fibonacci number? Show your work."
stream_claude_extended_thinking(SIMPLE_REASONING, budget_tokens=3000)
Part 5: When to Use Reasoning ModelsΒΆ
Decision FrameworkΒΆ
Is the problem well-defined with a verifiable answer?
NO --> Standard model (GPT-4o, Claude 3.5 Sonnet)
YES -->
Does it require multiple logical steps?
NO --> Standard model
YES -->
Is latency critical (< 2s)?
YES --> Standard model + chain-of-thought prompt
NO -->
Is cost critical?
YES --> DeepSeek R1 7B (local) or o3-mini low
NO --> o3-mini high or Claude extended thinking
Problem TaxonomyΒΆ
Problem Type |
Standard |
Reasoning |
Notes |
|---|---|---|---|
Math olympiad (AIME) |
Poor |
Excellent |
Deep multi-step proof |
LeetCode hard |
Poor |
Excellent |
Algorithm design |
Simple arithmetic |
Good |
Overkill |
Waste of money |
Multi-step debugging |
Fair |
Good |
Root cause analysis |
Simple bug fix |
Good |
Overkill |
|
Complex planning |
Fair |
Good |
Constraint satisfaction |
Simple Q&A |
Excellent |
Overkill |
|
Code generation (hard) |
Fair |
Good |
Edge cases |
Creative writing |
Excellent |
Worse |
Reasoning hurts creativity |
Summarization |
Excellent |
Overkill |
# Cost-benefit calculator for reasoning models
PRICING = {
"gpt-4o": {
"input_per_M": 5.00,
"output_per_M": 15.00,
"reasoning_per_M": 0,
},
"o3-mini-low": {
"input_per_M": 1.10,
"output_per_M": 4.40,
"reasoning_per_M": 4.40,
},
"o3-mini-high": {
"input_per_M": 1.10,
"output_per_M": 4.40,
"reasoning_per_M": 4.40,
},
"claude-opus-4-6": {
"input_per_M": 15.00,
"output_per_M": 75.00,
"reasoning_per_M": 75.00, # thinking tokens billed as output
},
"deepseek-r1-api": {
"input_per_M": 0.55,
"output_per_M": 2.19,
"reasoning_per_M": 0, # thinking visible but not billed separately
},
"deepseek-r1-local": {
"input_per_M": 0,
"output_per_M": 0,
"reasoning_per_M": 0, # electricity cost only
},
}
# Typical token estimates per task type
TASK_PROFILES = {
"math_olympiad": {
"description": "AIME-level competition problem",
"input": 500,
"output": 800,
"gpt4o_accuracy": "20%",
"o3_reasoning_tokens": {"o3-mini-low": 3000, "o3-mini-high": 15000},
"claude_thinking": 12000,
},
"leetcode_hard": {
"description": "LeetCode hard coding challenge",
"input": 400,
"output": 600,
"gpt4o_accuracy": "45%",
"o3_reasoning_tokens": {"o3-mini-low": 1500, "o3-mini-high": 8000},
"claude_thinking": 6000,
},
"simple_qa": {
"description": "Simple factual question",
"input": 100,
"output": 100,
"gpt4o_accuracy": "98%",
"o3_reasoning_tokens": {"o3-mini-low": 200, "o3-mini-high": 500},
"claude_thinking": 1024,
},
}
def calculate_cost(model: str, input_tokens: int, output_tokens: int, reasoning_tokens: int = 0) -> float:
p = PRICING[model]
return (
(input_tokens / 1_000_000) * p["input_per_M"]
+ (output_tokens / 1_000_000) * p["output_per_M"]
+ (reasoning_tokens / 1_000_000) * p["reasoning_per_M"]
)
print("Cost-Benefit Analysis: Reasoning vs Standard Models")
print("=" * 65)
for task_name, task in TASK_PROFILES.items():
print(f"\nTask: {task['description']}")
print(f" GPT-4o accuracy (baseline): {task['gpt4o_accuracy']}")
print(f" {'Model':<22} {'$/request':>10} {'$1000 reqs':>12}")
print(f" {'-'*22} {'-'*10} {'-'*12}")
for model in ["gpt-4o", "o3-mini-low", "o3-mini-high", "deepseek-r1-api", "claude-opus-4-6"]:
r_tokens = 0
if model in task["o3_reasoning_tokens"]:
r_tokens = task["o3_reasoning_tokens"][model]
elif model == "claude-opus-4-6":
r_tokens = task["claude_thinking"]
cost = calculate_cost(model, task["input"], task["output"], r_tokens)
print(f" {model:<22} ${cost:>9.5f} ${cost*1000:>11.2f}")
Part 6: Practical Comparison - Same Problem Across ModelsΒΆ
We will run the same challenging problem through:
Standard GPT-4o (fast thinking)
o3-mini with high reasoning (slow thinking)
DeepSeek R1 via cloud API (open source reasoning)
Comparison dimensions: accuracy, latency, reasoning trace quality, token cost
COMPARISON_PROBLEM = """
A snail is at the bottom of a 30-foot well.
Each day it climbs up 3 feet, but each night it slides back 2 feet.
On what day does the snail reach or pass the top of the well?
Also: if the well were 100 feet deep instead of 30, what day would it escape?
Provide a general formula for a well of depth D, day-climb C, night-slide S.
"""
results = {}
# --- Model 1: Standard GPT-4o ---
def call_gpt4o(problem: str) -> dict:
if not OPENAI_API_KEY:
return {"text": None, "latency": None}
client = OpenAI(api_key=OPENAI_API_KEY)
start = time.time()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": problem}]
)
return {
"text": response.choices[0].message.content,
"latency": round(time.time() - start, 2),
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens,
"reasoning_tokens": 0
}
print("Running comparison across models...")
print("(Models will be skipped if API keys are not set)")
print()
# GPT-4o
print("[1/3] Testing GPT-4o...")
results["gpt-4o"] = call_gpt4o(COMPARISON_PROBLEM)
if results["gpt-4o"]["text"]:
print(f" Done in {results['gpt-4o']['latency']}s")
# o3-mini high
print("[2/3] Testing o3-mini (high reasoning)...")
results["o3-mini-high"] = call_o3_mini(COMPARISON_PROBLEM, effort="high")
if results["o3-mini-high"]["text"]:
print(f" Done in {results['o3-mini-high']['latency']}s")
# DeepSeek R1 cloud
print("[3/3] Testing DeepSeek R1 (OpenRouter)...")
results["deepseek-r1"] = call_deepseek_r1_openrouter(COMPARISON_PROBLEM)
if results["deepseek-r1"]["text"]:
print(f" Done in {results['deepseek-r1']['latency']}s")
print("\nAll models tested.")
# Display comparison results
print("=" * 65)
print("COMPARISON RESULTS")
print("=" * 65)
for model_name, result in results.items():
print(f"\nModel: {model_name}")
print("-" * 40)
if result.get("text") is None:
print(" [Not run - API key not available]")
continue
print(f" Latency : {result.get('latency')}s")
print(f" Input tokens : {result.get('input_tokens', 'N/A')}")
print(f" Output tokens : {result.get('output_tokens', 'N/A')}")
print(f" Reasoning tkns : {result.get('reasoning_tokens', 'N/A')}")
# Show thinking trace if available (DeepSeek)
if result.get("think"):
think_preview = result["think"][:300].replace("\n", " ")
print(f" Thinking trace : {think_preview}...")
# Show answer
answer_preview = result["text"][:400]
print(f" Answer preview : {answer_preview}...")
print()
print("Note: The correct answers are:")
print(" 30-foot well : Day 28 (snail reaches top on day 28)")
print(" 100-foot well: Day 98")
print(" General formula: ceil((D - C) / (C - S)) + 1 when C > S")
Part 7: Prompt Engineering for Reasoning ModelsΒΆ
How Reasoning Models DifferΒΆ
Standard models need prompts that guide them step-by-step. Reasoning models have internalized this.
Key differences:
Technique |
Standard Model |
Reasoning Model |
|---|---|---|
Chain-of-thought examples |
Very helpful |
Redundant (can hurt) |
βThink step by stepβ |
Essential |
Unnecessary |
Verbose system prompts |
OK |
Worse (can constrain reasoning) |
Specific output format |
Required |
State clearly once |
Few-shot examples |
Very helpful |
Minimal needed |
Constraints |
Enumerate all |
Trust model judgment |
Best PracticesΒΆ
Be concise in the system prompt - long system prompts can anchor the reasoning in wrong directions
Donβt demonstrate reasoning - the model does it internally; examples can confuse it
State the goal, not the method - let the model find the best path
Avoid over-constraining - donβt say βfirst do X, then do Yβ unless required
Specify output format clearly once - the model will follow formatting without being told how to think
# Prompt engineering: bad vs good prompts for reasoning models
CODING_PROBLEM = """
Given an array of integers, find the length of the longest subarray
where the absolute difference between any two elements is at most 1.
Example: [1, 3, 2, 2, 5, 2, 3, 7] -> 5 (subarray [3,2,2,2,3])
"""
# BAD prompt for reasoning model - over-constrains the thinking
BAD_PROMPT_SYSTEM = """
You are a coding assistant. When solving coding problems:
Step 1: First read the problem carefully.
Step 2: Think about brute force solutions.
Step 3: Then think about optimizations.
Step 4: Consider time complexity.
Step 5: Consider edge cases.
Step 6: Write the final solution.
Always show your reasoning. Use chain of thought.
Think out loud before writing code.
"""
# GOOD prompt for reasoning model - states the goal, trusts the model
GOOD_PROMPT_SYSTEM = """
You are an expert Python programmer.
Provide the optimal solution with time and space complexity analysis.
"""
print("BAD system prompt (over-constrains reasoning):")
print("-" * 40)
print(BAD_PROMPT_SYSTEM)
print("Problems:")
print(" - Forces a specific reasoning order that may not be optimal")
print(" - 'Think out loud' is redundant (reasoning model already does this internally)")
print(" - Over-verbose instructions consume input token budget")
print()
print("GOOD system prompt (concise, goal-focused):")
print("-" * 40)
print(GOOD_PROMPT_SYSTEM)
print("Why it's better:")
print(" - Concise: doesn't interfere with internal reasoning")
print(" - States the goal (optimal solution + analysis)")
print(" - Trusts the model to figure out HOW to reason")
# Additional prompt engineering tips with examples
prompt_tips = [
{
"tip": "Avoid few-shot examples that show reasoning steps",
"bad": """Example 1:
Q: What is 5+3?
A: Let me think step by step. First I have 5. Then I add 3. So 5+3=8.
Example 2:
Q: What is 7*6?
A: Let me think step by step. I know 7*6... [etc]""",
"good": """Q: What is 5+3? A: 8
Q: What is 7*6? A: 42""",
"reason": "Showing reasoning examples trains the model to mimic your style, "
"overriding its more capable internal reasoning."
},
{
"tip": "State constraints in the problem, not as process instructions",
"bad": "First check if x > 0, then check if x < 100, then compute log(x)",
"good": "Compute log(x) for x in range (0, 100), exclusive. Handle edge cases.",
"reason": "Let the model decide HOW to validate. State WHAT is needed."
},
{
"tip": "For multi-part problems, number parts clearly but don't prescribe order",
"bad": "Part 1: Do A. Then for Part 2: build on Part 1 to do B. Then Part 3...",
"good": "Answer all three parts:\n1. [Part A description]\n2. [Part B description]\n3. [Part C description]",
"reason": "Clear structure without imposing sequential dependency the model may not need."
}
]
for i, tip in enumerate(prompt_tips, 1):
print(f"Tip {i}: {tip['tip']}")
print(f" Why: {tip['reason']}")
print()
# Live test: concise prompt with o3-mini on a coding problem
def call_o3_mini_with_system(system: str, user: str, effort: str = "medium") -> dict:
if not OPENAI_API_KEY:
return {"text": None}
client = OpenAI(api_key=OPENAI_API_KEY)
start = time.time()
response = client.chat.completions.create(
model="o3-mini",
reasoning_effort=effort,
messages=[
{"role": "developer", "content": system}, # o-series uses 'developer' role
{"role": "user", "content": user}
]
)
return {
"text": response.choices[0].message.content,
"latency": round(time.time() - start, 2),
"reasoning_tokens": getattr(
response.usage.completion_tokens_details, "reasoning_tokens", "N/A"
)
}
print("Testing good prompt vs bad prompt on o3-mini:")
print()
print("[1/2] Good prompt (concise system):")
r_good = call_o3_mini_with_system(GOOD_PROMPT_SYSTEM, CODING_PROBLEM, effort="medium")
if r_good["text"]:
print(f" Latency: {r_good['latency']}s | Reasoning tokens: {r_good['reasoning_tokens']}")
print(f" Answer preview: {r_good['text'][:400]}...")
print()
print("[2/2] Bad prompt (over-constrained system):")
r_bad = call_o3_mini_with_system(BAD_PROMPT_SYSTEM, CODING_PROBLEM, effort="medium")
if r_bad["text"]:
print(f" Latency: {r_bad['latency']}s | Reasoning tokens: {r_bad['reasoning_tokens']}")
print(f" Answer preview: {r_bad['text'][:400]}...")
Part 8: Benchmark Comparison TableΒΆ
Key Benchmarks ExplainedΒΆ
Benchmark |
What It Tests |
Difficulty |
|---|---|---|
AIME 2024/2025 |
American Invitational Math Exam |
Extreme (top 5% of math competitors) |
MATH-500 |
Hendrycks MATH dataset (500 problems) |
Hard (grad school math) |
HumanEval |
Python function generation (164 problems) |
Medium (interview-level coding) |
SWE-bench Verified |
Real GitHub issue resolution |
Hard (professional software engineering) |
GPQA Diamond |
PhD-level science questions |
Extreme |
Performance Scores (as of early 2025)ΒΆ
# Benchmark comparison table
# Sources: OpenAI, Anthropic, DeepSeek technical reports (Jan-Apr 2025)
benchmarks = {
"Model": [
"GPT-4o (standard)",
"o1 (2024)",
"o3",
"o3-mini (high)",
"o4-mini",
"DeepSeek R1 (671B)",
"DeepSeek R1-Distill-7B",
"Claude 3.7 Sonnet (extended)",
"Claude Opus 4.6 (extended)",
],
"Type": [
"Standard",
"Reasoning",
"Reasoning",
"Reasoning",
"Reasoning",
"Reasoning (OSS)",
"Reasoning (OSS, small)",
"Extended Thinking",
"Extended Thinking",
],
"AIME 2025 (%)": [
"9.3",
"74.3",
"86.7",
"79.6",
"92.7",
"70.0",
"52.8",
"80.0",
"~85 (est.)",
],
"MATH-500 (%)": [
"74.6",
"96.4",
"97.9",
"97.1",
"97.6",
"97.3",
"89.1",
"96.2",
"~97 (est.)",
],
"HumanEval (%)": [
"90.2",
"92.4",
"~95",
"94.0",
"95.2",
"92.6",
"79.3",
"93.7",
"~95 (est.)",
],
"SWE-bench (%)": [
"38.5",
"48.9",
"71.7",
"49.3",
"68.1",
"49.2",
"N/A",
"70.3",
"~72 (est.)",
],
"GPQA Diamond (%)": [
"53.6",
"78.3",
"87.7",
"79.7",
"~81",
"71.5",
"49.1",
"84.8",
"~86 (est.)",
],
"Open Source": [
"No",
"No",
"No",
"No",
"No",
"Yes (MIT)",
"Yes (MIT)",
"No",
"No",
],
"Run Locally": [
"No",
"No",
"No",
"No",
"No",
"Needs cluster",
"Yes (7B via Ollama)",
"No",
"No",
],
}
# Print as a formatted table
col_widths = {k: max(len(k), max(len(str(v)) for v in vals))
for k, vals in benchmarks.items()}
header = " ".join(k.ljust(col_widths[k]) for k in benchmarks)
print(header)
print("-" * len(header))
n_rows = len(list(benchmarks.values())[0])
for i in range(n_rows):
row = " ".join(str(benchmarks[k][i]).ljust(col_widths[k]) for k in benchmarks)
print(row)
# Key insights from the benchmark table
print("Key Insights from Benchmark Comparison")
print("=" * 55)
print()
insights = [
{
"finding": "Reasoning models dominate on AIME",
"detail": "GPT-4o scores 9.3% vs o4-mini at 92.7% - a 10x improvement."
" Standard models fail at competition math."
},
{
"finding": "DeepSeek R1 matches o1 at fraction of cost",
"detail": "R1 671B scores 70% AIME vs o1 74.3%, but costs 10-20x less per token."
" Remarkable given open-source nature."
},
{
"finding": "Small distilled models are surprisingly capable",
"detail": "DeepSeek R1-Distill-7B scores 52.8% on AIME - beating GPT-4o (9.3%)"
" despite being 100x smaller."
},
{
"finding": "SWE-bench is where reasoning shines for code",
"detail": "o3 and Claude 3.7 both exceed 70% on SWE-bench Verified (real bug fixes)."
" GPT-4o is at 38.5%."
},
{
"finding": "MATH-500 saturating - AIME is the harder discriminator",
"detail": "Most reasoning models exceed 96% on MATH-500."
" AIME 2025 remains the hard discriminator."
},
]
for i, insight in enumerate(insights, 1):
print(f"{i}. {insight['finding']}")
print(f" {insight['detail']}")
print()
SummaryΒΆ
What We CoveredΒΆ
Reasoning models use test-time compute scaling to βthinkβ before answering
OpenAI o-series (o3, o4-mini) use
reasoning_effortto control the compute budgetDeepSeek R1 is open-source reasoning: run via API or locally with Ollama; exposes
<think>tagsClaude Extended Thinking uses
budget_tokensto control thinking depth; thinking blocks are readablePrompt engineering for reasoning models means being concise and trusting the modelβs reasoning
Cost-benefit analysis shows reasoning models are only worth using for problems where accuracy matters
Quick ReferenceΒΆ
Goal |
Recommended Model |
Key Parameter |
|---|---|---|
Hard math / olympiad |
o3 or o4-mini |
|
Balanced reasoning + cost |
o3-mini |
|
Open source reasoning (cloud) |
DeepSeek R1 via OpenRouter |
Standard chat API |
Open source reasoning (local) |
DeepSeek R1:7B via Ollama |
|
Transparent reasoning trace |
Claude Opus 4.6 extended |
|
Fast, cheap, good enough |
GPT-4o |
Standard completion |
Next StepsΒΆ
Phase 12: RLHF and GRPO training - learn how R1 was built
Notebook 09: Agentic workflows with reasoning models (o3 as the βbrainβ)
Advanced: Combining reasoning models with tool use for complex agent tasks
# Final exercise: choose the right model for each task
print("Exercise: Match the task to the right model")
print("=" * 55)
tasks = [
{"task": "Summarize a 5-page document", "answer": "GPT-4o or Claude 3.5 Sonnet (no reasoning needed)"},
{"task": "Solve a 3-variable system of equations", "answer": "GPT-4o is usually enough; o3-mini if accuracy critical"},
{"task": "Debug why a distributed system deadlocks", "answer": "o3-mini high or Claude extended thinking"},
{"task": "Write a product description", "answer": "Standard model - creativity, not reasoning"},
{"task": "Prove a number theory theorem", "answer": "o3 high or Claude extended (budget_tokens=20000)"},
{"task": "Generate 10 tweet variations", "answer": "Standard model (GPT-4o, Claude 3.5 Sonnet)"},
{"task": "Optimize a complex SQL query with 8 joins", "answer": "o3-mini medium or DeepSeek R1"},
{"task": "Private: analyze confidential medical data", "answer": "DeepSeek R1:7B locally via Ollama (no data leaves machine)"},
]
for i, item in enumerate(tasks, 1):
print(f"{i}. Task: {item['task']}")
print(f" Best choice: {item['answer']}")
print()
print("Remember: reasoning models are more expensive and slower.")
print("Only use them when the accuracy boost justifies the cost.")