Run this notebook: Open in Colab Open in Kaggle

Notebook 08: Working with Reasoning Models¶

o3, DeepSeek R1, and Claude Extended Thinking¶

What You’ll Learn¶

What are reasoning models? - Fast vs slow thinking, test-time compute scaling
OpenAI o-series - o1, o3, o3-mini, o4-mini with reasoning_effort
DeepSeek R1 - Open-source reasoning via API and locally with Ollama
Claude Extended Thinking - Budget tokens and thinking blocks
When to use reasoning models - Cost-benefit analysis
Practical comparison - Same problem across models
Prompt engineering for reasoning models
Benchmark comparison table - AIME, MATH-500, HumanEval, SWE-bench

Prerequisites: OpenAI API key, Anthropic API key (optional), Ollama installed (optional)

# Install required packages
!pip install openai anthropic python-dotenv ollama -q

import os
import time
import json
import re
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")

print("OpenAI API key:", "SET" if OPENAI_API_KEY else "NOT SET - some cells will be skipped")
print("Anthropic API key:", "SET" if ANTHROPIC_API_KEY else "NOT SET - some cells will be skipped")

Part 1: What Are Reasoning Models?¶

System 1 vs System 2 Thinking¶

Psychologist Daniel Kahneman’s framework maps perfectly onto LLM architectures:

	System 1 (Fast Thinking)	System 2 (Slow Thinking)
Speed	Instant	Deliberate
Effort	Automatic	Effortful
LLM Example	GPT-4o, Claude 3.5 Sonnet	o3, DeepSeek R1, Claude Extended Thinking
Good For	Chat, summarization, classification	Math, code, logic, planning
Failure Mode	Wrong on hard problems	Slow and expensive

Test-Time Compute Scaling¶

Traditional scaling: train bigger models with more data and parameters.

Test-time compute scaling: spend more compute at inference time to improve accuracy.

Standard Model:   [User Prompt] --> [Single Forward Pass] --> [Answer]

Reasoning Model:  [User Prompt] --> [Think: step 1... step 2... step 3...] --> [Answer]
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                        Thinking tokens (billed but not always shown)

Chain-of-Thought in Latent Space¶

Classic chain-of-thought (CoT) prompts models to “think step by step” in the output. Reasoning models do this internally via special “thinking tokens” before producing the final answer.

OpenAI o-series: Thinking tokens are hidden (you pay for them but cannot read them)
DeepSeek R1: Thinking is exposed inside <think>...</think> XML tags
Claude Extended Thinking: Thinking is returned as separate thinking content blocks

Why They Outperform Standard Models¶

Self-verification: The model checks its own work before committing to an answer
Backtracking: If a reasoning path fails, the model tries a different approach
Deeper decomposition: Complex problems are broken into sub-problems
Reduced hallucination: Structured reasoning catches logical errors

# Demonstration: Standard model vs reasoning model on a hard problem
# This cell illustrates the concept even without an API key

HARD_MATH_PROBLEM = """
A farmer has 17 sheep. All but 9 die. How many sheep are left?
"""

# Classic "trick" problem - standard models often answer 8 (wrong)
# Reasoning models parse "all but 9" correctly = 9 remain

print("Problem:", HARD_MATH_PROBLEM.strip())
print()
print("Common wrong answer: 8  (model calculates 17 - 9 = 8)")
print("Correct answer: 9      ('all but 9' means 9 survive)")
print()
print("Reasoning model approach:")
print("  <think>")
print("  The phrase 'all but 9 die' means 9 sheep do NOT die.")
print("  Therefore 9 sheep are left alive.")
print("  The total count of 17 is a distractor.")
print("  </think>")
print("  Answer: 9")

Part 2: OpenAI Reasoning Models (o1, o3, o3-mini, o4-mini)¶

Model Lineup¶

Model	Release	Strengths	Cost (approx)
o1	Sep 2024	Best accuracy, frontier	\(15 / \)60 per M tokens
o1-mini	Sep 2024	Faster, cheaper	\(3 / \)12 per M tokens
o3	Apr 2025	Successor to o1, top AIME	\(10 / \)40 per M tokens
o3-mini	Jan 2025	Best value reasoning	\(1.10 / \)4.40 per M tokens
o4-mini	Apr 2025	Latest small, vision	\(1.10 / \)4.40 per M tokens

The `reasoning_effort` Parameter¶

Unlike standard models, o-series models expose a reasoning_effort knob:

low - Fastest, least thinking. Good for simple reasoning tasks.
medium - Balanced. Default for most use cases.
high - Maximum reasoning. Best accuracy on hard problems, most expensive.

Key Limitations¶

No streaming of thinking tokens (only final answer can be streamed)
System prompt restrictions on some older o-series models (use developer role)
Reasoning tokens are billed in addition to input/output tokens
No temperature support (model manages its own exploration)

from openai import OpenAI

def call_o3_mini(problem: str, effort: str = "medium") -> dict:
    """
    Call o3-mini with a specified reasoning_effort level.
    Returns response text, token usage, and latency.
    """
    if not OPENAI_API_KEY:
        print("[SKIP] OPENAI_API_KEY not set.")
        return {"text": None, "usage": None, "latency": None}

    client = OpenAI(api_key=OPENAI_API_KEY)
    start = time.time()

    response = client.chat.completions.create(
        model="o3-mini",
        reasoning_effort=effort,   # "low" | "medium" | "high"
        messages=[
            {"role": "user", "content": problem}
        ]
    )

    latency = round(time.time() - start, 2)
    message = response.choices[0].message.content
    usage = response.usage

    return {
        "text": message,
        "latency": latency,
        "input_tokens": usage.prompt_tokens,
        "output_tokens": usage.completion_tokens,
        "reasoning_tokens": getattr(usage.completion_tokens_details, "reasoning_tokens", "N/A"),
    }


# --- Example problem ---
LOGIC_PROBLEM = """
There are 5 houses in a row. Each house is painted a different color and
inhabited by a person of a different nationality, who drinks a different
beverage, smokes a different brand of cigars, and keeps a different pet.

Clues:
1. The Brit lives in the red house.
2. The Swede keeps dogs as pets.
3. The Dane drinks tea.
4. The green house is on the left of the white house.
5. The green house owner drinks coffee.
6. The person who smokes Pall Mall rears birds.
7. The owner of the yellow house smokes Dunhill.
8. The man living in the center house drinks milk.
9. The Norwegian lives in the first house.
10. The man who smokes Blends lives next to the one who keeps cats.
11. The man who keeps horses lives next to the man who smokes Dunhill.
12. The owner who smokes BlueMaster drinks beer.
13. The German smokes Prince.
14. The Norwegian lives next to the blue house.
15. The man who smokes Blends has a neighbor who drinks water.

Who owns the fish?
"""

print("Einstein's Riddle (Zebra Puzzle) loaded.")
print("This is a classic logic puzzle that requires systematic reasoning.")
print("We will test it with different reasoning_effort levels.")

# Test with low reasoning effort
print("=" * 60)
print("o3-mini with reasoning_effort='low'")
print("=" * 60)

result_low = call_o3_mini(LOGIC_PROBLEM, effort="low")

if result_low["text"]:
    print(f"Answer: {result_low['text'][:500]}...")
    print(f"\nLatency: {result_low['latency']}s")
    print(f"Input tokens: {result_low['input_tokens']}")
    print(f"Output tokens: {result_low['output_tokens']}")
    print(f"Reasoning tokens: {result_low['reasoning_tokens']}")

# Test with high reasoning effort
print("=" * 60)
print("o3-mini with reasoning_effort='high'")
print("=" * 60)

result_high = call_o3_mini(LOGIC_PROBLEM, effort="high")

if result_high["text"]:
    print(f"Answer: {result_high['text'][:500]}...")
    print(f"\nLatency: {result_high['latency']}s")
    print(f"Input tokens: {result_high['input_tokens']}")
    print(f"Output tokens: {result_high['output_tokens']}")
    print(f"Reasoning tokens: {result_high['reasoning_tokens']}")

    # Compare
    if result_low["latency"]:
        print(f"\n--- Effort Comparison ---")
        print(f"low  reasoning_tokens: {result_low['reasoning_tokens']}  | latency: {result_low['latency']}s")
        print(f"high reasoning_tokens: {result_high['reasoning_tokens']} | latency: {result_high['latency']}s")

# Cost calculator for reasoning tokens
# o3-mini pricing (as of 2025): $1.10 per M input, $4.40 per M output/reasoning

def estimate_cost_o3_mini(input_tokens: int, output_tokens: int, reasoning_tokens: int) -> float:
    """
    Estimate cost for o3-mini.
    Pricing: $1.10/M input, $4.40/M output (reasoning tokens billed as output).
    """
    input_cost  = (input_tokens / 1_000_000) * 1.10
    output_cost = ((output_tokens + reasoning_tokens) / 1_000_000) * 4.40
    return input_cost + output_cost


# Simulate a batch of 1000 requests
print("Cost simulation: 1000 identical o3-mini requests")
print("=" * 50)

scenarios = [
    ("low",    300,  100,  200),
    ("medium", 300,  150,  800),
    ("high",   300,  200, 3000),
]

for effort, inp, out, reason in scenarios:
    cost_per_req = estimate_cost_o3_mini(inp, out, reason)
    cost_1000    = cost_per_req * 1000
    print(f"reasoning_effort='{effort}':")
    print(f"  Estimated reasoning tokens per req : {reason}")
    print(f"  Cost per request                   : ${cost_per_req:.5f}")
    print(f"  Cost for 1000 requests             : ${cost_1000:.2f}")
    print()

Best Use Cases for OpenAI o-series¶

Use Case	Reasoning Effort	Why
Math olympiad / AIME	high	Requires deep multi-step proof
LeetCode hard / competitive programming	high	Subtle edge cases
Complex debugging (multi-file)	medium-high	Root cause analysis
SQL query optimization	medium	Logical planning
Scientific hypothesis checking	medium	Literature reasoning
Simple arithmetic / classification	Skip o-series	Standard GPT-4o is cheaper
Creative writing	Skip o-series	Reasoning doesn’t help creativity

Part 3: DeepSeek R1 - Open Source Reasoning¶

Why DeepSeek R1 Matters¶

DeepSeek R1 was released in January 2025 and shocked the AI community:

Matches o1 on benchmarks at a fraction of training cost
Fully open weights (MIT license) - you can run it locally
Thinking is visible - <think> tags expose the full reasoning trace
GRPO training - uses Group Relative Policy Optimization instead of PPO

Model Sizes (Distilled)¶

R1 knowledge was distilled into smaller Qwen/Llama base models:

Model	Parameters	VRAM Required	Use Case
deepseek-r1:1.5b	1.5B	~2 GB	Embedded, mobile
deepseek-r1:7b	7B	~6 GB	Laptop GPU
deepseek-r1:8b	8B	~8 GB	Laptop GPU
deepseek-r1:14b	14B	~12 GB	Desktop GPU
deepseek-r1:32b	32B	~24 GB	High-end GPU
deepseek-r1:70b	70B	~48 GB	Multi-GPU
deepseek-r1 (full)	671B	~400 GB	Cluster

GRPO Training (Brief Overview)¶

Standard RLHF uses PPO (Proximal Policy Optimization) which requires a separate critic model. DeepSeek R1 used GRPO (Group Relative Policy Optimization):

PPO:  Model --> Critic (separate) --> Reward --> Update
GRPO: Model --> Sample group of answers --> Rank them --> Update using relative rewards

GRPO eliminates the critic model, cutting memory and compute requirements by ~50%. Full R1 training details are covered in Phase 12 (RLHF).

# --- Option A: DeepSeek R1 via OpenRouter (cloud API) ---
# OpenRouter provides unified access to many models including DeepSeek R1.
# API is OpenAI-compatible - just change the base_url and model name.

OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")

def call_deepseek_r1_openrouter(problem: str) -> dict:
    """
    Call DeepSeek R1 via OpenRouter.
    OpenRouter API is OpenAI-compatible.
    Sign up at https://openrouter.ai to get a free API key.
    """
    if not OPENROUTER_API_KEY:
        print("[SKIP] OPENROUTER_API_KEY not set. Get one free at https://openrouter.ai")
        return {"text": None, "think": None, "latency": None}

    client = OpenAI(
        api_key=OPENROUTER_API_KEY,
        base_url="https://openrouter.ai/api/v1"
    )

    start = time.time()
    response = client.chat.completions.create(
        model="deepseek/deepseek-r1",
        messages=[{"role": "user", "content": problem}]
    )
    latency = round(time.time() - start, 2)

    full_text = response.choices[0].message.content
    think_text, answer_text = parse_think_tags(full_text)

    return {
        "text": answer_text,
        "think": think_text,
        "latency": latency
    }


def parse_think_tags(text: str) -> tuple[str, str]:
    """
    Parse <think>...</think> tags from DeepSeek R1 output.
    Returns (thinking_content, final_answer).
    """
    if text is None:
        return "", ""

    think_match = re.search(r"<think>(.*?)</think>", text, re.DOTALL)
    think_content = think_match.group(1).strip() if think_match else ""

    # Answer is everything after the closing </think> tag
    answer = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()

    return think_content, answer


# Test the parser on a synthetic example
sample_r1_output = """
<think>
Let me think about this step by step.
The problem asks for the sum of the first 10 natural numbers.
Using the formula n*(n+1)/2 where n=10:
10 * 11 / 2 = 55
</think>

The sum of the first 10 natural numbers is **55**.
"""

think, answer = parse_think_tags(sample_r1_output)
print("Parsed <think> content:")
print(think)
print("\nParsed final answer:")
print(answer)

# Call DeepSeek R1 via OpenRouter on our logic problem
MATH_PROBLEM_SIMPLE = """
A train leaves City A at 9:00 AM traveling at 60 mph toward City B.
Another train leaves City B at 10:00 AM traveling at 80 mph toward City A.
The cities are 280 miles apart. At what time will the trains meet?
"""

print("Problem:", MATH_PROBLEM_SIMPLE.strip())
print()

result_r1_cloud = call_deepseek_r1_openrouter(MATH_PROBLEM_SIMPLE)

if result_r1_cloud["text"]:
    print("=" * 60)
    print("DeepSeek R1 Thinking Process:")
    print("=" * 60)
    print(result_r1_cloud["think"][:1000] if result_r1_cloud["think"] else "(thinking hidden)")
    print()
    print("=" * 60)
    print("Final Answer:")
    print("=" * 60)
    print(result_r1_cloud["text"])
    print(f"\nLatency: {result_r1_cloud['latency']}s")

# --- Option B: DeepSeek R1 via Ollama (local, fully private) ---
# Requirements:
#   1. Install Ollama: https://ollama.ai
#   2. Pull the model: ollama pull deepseek-r1:7b
#   3. Ollama must be running (it starts automatically on install)

try:
    import ollama
    OLLAMA_AVAILABLE = True
except ImportError:
    OLLAMA_AVAILABLE = False
    print("ollama package not installed. Run: pip install ollama")


def call_deepseek_r1_ollama(problem: str, model: str = "deepseek-r1:7b") -> dict:
    """
    Call DeepSeek R1 locally via Ollama.
    
    First time setup:
        brew install ollama          # macOS
        ollama pull deepseek-r1:7b   # ~5 GB download
    
    Available sizes: 1.5b, 7b, 8b, 14b, 32b, 70b
    """
    if not OLLAMA_AVAILABLE:
        print("[SKIP] ollama not installed.")
        return {"text": None, "think": None, "latency": None}

    # Check if Ollama daemon is running
    try:
        models = ollama.list()
        available_models = [m["name"] for m in models.get("models", [])]
    except Exception:
        print("[SKIP] Ollama daemon not running. Start with: ollama serve")
        return {"text": None, "think": None, "latency": None}

    if not any(model.split(":")[0] in m for m in available_models):
        print(f"[SKIP] Model '{model}' not found locally.")
        print(f"       Pull it with: ollama pull {model}")
        print(f"       Available models: {available_models}")
        return {"text": None, "think": None, "latency": None}

    start = time.time()
    response = ollama.chat(
        model=model,
        messages=[{"role": "user", "content": problem}]
    )
    latency = round(time.time() - start, 2)

    full_text = response["message"]["content"]
    think_content, answer = parse_think_tags(full_text)

    return {
        "text": answer,
        "think": think_content,
        "full_response": full_text,
        "latency": latency
    }


# Run locally
result_r1_local = call_deepseek_r1_ollama(MATH_PROBLEM_SIMPLE, model="deepseek-r1:7b")

if result_r1_local["text"]:
    print("DeepSeek R1 7B (local via Ollama)")
    print("=" * 60)
    print("Thinking:")
    print(result_r1_local["think"][:800])
    print("\nAnswer:")
    print(result_r1_local["text"])
    print(f"\nLatency: {result_r1_local['latency']}s (local inference)")

# Advanced: Streaming DeepSeek R1 from Ollama
# Streaming lets you see the <think> tokens as they arrive

def stream_deepseek_r1_ollama(problem: str, model: str = "deepseek-r1:7b"):
    """
    Stream DeepSeek R1 output from Ollama.
    Prints thinking tokens in real-time.
    """
    if not OLLAMA_AVAILABLE:
        print("[SKIP] ollama not installed.")
        return

    try:
        stream = ollama.chat(
            model=model,
            messages=[{"role": "user", "content": problem}],
            stream=True
        )

        print("Streaming response (including <think> tokens):")
        print("-" * 40)

        in_think = False
        buffer = ""

        for chunk in stream:
            token = chunk["message"]["content"]
            buffer += token

            # Color-code thinking vs answer
            if "<think>" in buffer:
                in_think = True
            if "</think>" in buffer:
                in_think = False

            prefix = "[THINK] " if in_think else "[ANS]   "
            print(token, end="", flush=True)

        print("\n" + "-" * 40)

    except Exception as e:
        print(f"[SKIP] Streaming failed: {e}")


# Uncomment to stream (requires Ollama + model)
# stream_deepseek_r1_ollama("What is 15 factorial?")
print("Streaming function defined. Uncomment the last line to test with a running Ollama instance.")

Part 4: Anthropic Claude Extended Thinking (Claude Opus 4.6)¶

How Claude Extended Thinking Works¶

Claude’s extended thinking is activated by passing a thinking configuration block:

thinking={"type": "enabled", "budget_tokens": N}

The response will contain a list of content blocks, some of type "thinking" and others of type "text".

Budget Tokens¶

Minimum: 1,024 tokens
Maximum: 100,000 tokens (with max_tokens >= budget_tokens + 1)
Recommended starting point: 5,000-10,000 for most hard problems
When to increase: If the model says “I need more space to think” or gives incorrect answers

Thinking tokens are billed at the same rate as output tokens.

Key Differences from o-series¶

Feature	OpenAI o-series	Claude Extended Thinking
Thinking visible?	No (hidden)	Yes (thinking blocks)
Control granularity	`reasoning_effort` (3 levels)	`budget_tokens` (continuous)
Streaming	Final answer only	Thinking + answer streamable
System prompt	Restricted on older models	Normal support

import anthropic

def call_claude_extended_thinking(
    problem: str,
    budget_tokens: int = 10000,
    model: str = "claude-opus-4-6"
) -> dict:
    """
    Call Claude with extended thinking enabled.
    Returns thinking blocks and final answer separately.

    Args:
        problem: The problem to solve
        budget_tokens: Max tokens Claude can use for thinking (min 1024)
        model: Claude model - must support extended thinking (claude-opus-4-6)
    """
    if not ANTHROPIC_API_KEY:
        print("[SKIP] ANTHROPIC_API_KEY not set.")
        return {"thinking": None, "answer": None, "latency": None}

    client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)

    # max_tokens must be > budget_tokens to leave room for the final answer
    max_tokens = budget_tokens + 4000

    start = time.time()

    response = client.messages.create(
        model=model,
        max_tokens=max_tokens,
        thinking={
            "type": "enabled",
            "budget_tokens": budget_tokens
        },
        messages=[
            {"role": "user", "content": problem}
        ]
    )

    latency = round(time.time() - start, 2)

    # Separate thinking blocks from text blocks
    thinking_blocks = []
    text_blocks = []

    for block in response.content:
        if block.type == "thinking":
            thinking_blocks.append(block.thinking)
        elif block.type == "text":
            text_blocks.append(block.text)

    return {
        "thinking": "\n\n".join(thinking_blocks),
        "answer": "\n\n".join(text_blocks),
        "latency": latency,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        # cache_read_input_tokens available in newer SDK versions
        "stop_reason": response.stop_reason
    }


print("Claude extended thinking function defined.")

# Test Claude extended thinking on a hard math problem
HARD_MATH = """
Find all integer solutions (x, y) to the equation:
    x^2 - y^2 = 2024

Show all solutions and prove there are no others.
"""

print("Problem:", HARD_MATH.strip())
print()

result_claude = call_claude_extended_thinking(HARD_MATH, budget_tokens=8000)

if result_claude["thinking"]:
    print("=" * 60)
    print("Claude's Thinking (first 800 chars):")
    print("=" * 60)
    print(result_claude["thinking"][:800])
    print()
    print("=" * 60)
    print("Final Answer:")
    print("=" * 60)
    print(result_claude["answer"])
    print(f"\nLatency: {result_claude['latency']}s")
    print(f"Input tokens: {result_claude['input_tokens']}")
    print(f"Output tokens: {result_claude['output_tokens']} (includes thinking)")

# Streaming Claude Extended Thinking
# With streaming, thinking blocks arrive as they are generated

def stream_claude_extended_thinking(problem: str, budget_tokens: int = 5000):
    """
    Stream Claude extended thinking response.
    Thinking blocks stream as they are generated.
    """
    if not ANTHROPIC_API_KEY:
        print("[SKIP] ANTHROPIC_API_KEY not set.")
        return

    client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)

    print("Streaming extended thinking response...")
    print("-" * 40)

    with client.messages.stream(
        model="claude-opus-4-6",
        max_tokens=budget_tokens + 4000,
        thinking={
            "type": "enabled",
            "budget_tokens": budget_tokens
        },
        messages=[{"role": "user", "content": problem}]
    ) as stream:
        current_block_type = None

        for event in stream:
            event_type = type(event).__name__

            if event_type == "ContentBlockStart":
                block = event.content_block
                current_block_type = block.type
                if block.type == "thinking":
                    print("\n[THINKING BLOCK STARTED]")
                elif block.type == "text":
                    print("\n[ANSWER BLOCK STARTED]")

            elif event_type == "ContentBlockDelta":
                delta = event.delta
                if hasattr(delta, "thinking"):
                    print(delta.thinking, end="", flush=True)
                elif hasattr(delta, "text"):
                    print(delta.text, end="", flush=True)

            elif event_type == "ContentBlockStop":
                if current_block_type == "thinking":
                    print("\n[THINKING BLOCK ENDED]")
                elif current_block_type == "text":
                    print("\n[ANSWER BLOCK ENDED]")

    print("-" * 40)


# Test streaming (requires Anthropic API key)
SIMPLE_REASONING = "What is the 20th Fibonacci number? Show your work."

stream_claude_extended_thinking(SIMPLE_REASONING, budget_tokens=3000)

Part 5: When to Use Reasoning Models¶

Decision Framework¶

Is the problem well-defined with a verifiable answer?
    NO  --> Standard model (GPT-4o, Claude 3.5 Sonnet)
    YES -->
        Does it require multiple logical steps?
        NO  --> Standard model
        YES -->
            Is latency critical (< 2s)?
            YES --> Standard model + chain-of-thought prompt
            NO  -->
                Is cost critical?
                YES --> DeepSeek R1 7B (local) or o3-mini low
                NO  --> o3-mini high or Claude extended thinking

Problem Taxonomy¶

Problem Type	Standard	Reasoning	Notes
Math olympiad (AIME)	Poor	Excellent	Deep multi-step proof
LeetCode hard	Poor	Excellent	Algorithm design
Simple arithmetic	Good	Overkill	Waste of money
Multi-step debugging	Fair	Good	Root cause analysis
Simple bug fix	Good	Overkill
Complex planning	Fair	Good	Constraint satisfaction
Simple Q&A	Excellent	Overkill
Code generation (hard)	Fair	Good	Edge cases
Creative writing	Excellent	Worse	Reasoning hurts creativity
Summarization	Excellent	Overkill

# Cost-benefit calculator for reasoning models

PRICING = {
    "gpt-4o": {
        "input_per_M": 5.00,
        "output_per_M": 15.00,
        "reasoning_per_M": 0,
    },
    "o3-mini-low": {
        "input_per_M": 1.10,
        "output_per_M": 4.40,
        "reasoning_per_M": 4.40,
    },
    "o3-mini-high": {
        "input_per_M": 1.10,
        "output_per_M": 4.40,
        "reasoning_per_M": 4.40,
    },
    "claude-opus-4-6": {
        "input_per_M": 15.00,
        "output_per_M": 75.00,
        "reasoning_per_M": 75.00,   # thinking tokens billed as output
    },
    "deepseek-r1-api": {
        "input_per_M": 0.55,
        "output_per_M": 2.19,
        "reasoning_per_M": 0,       # thinking visible but not billed separately
    },
    "deepseek-r1-local": {
        "input_per_M": 0,
        "output_per_M": 0,
        "reasoning_per_M": 0,       # electricity cost only
    },
}

# Typical token estimates per task type
TASK_PROFILES = {
    "math_olympiad": {
        "description": "AIME-level competition problem",
        "input": 500,
        "output": 800,
        "gpt4o_accuracy": "20%",
        "o3_reasoning_tokens": {"o3-mini-low": 3000, "o3-mini-high": 15000},
        "claude_thinking": 12000,
    },
    "leetcode_hard": {
        "description": "LeetCode hard coding challenge",
        "input": 400,
        "output": 600,
        "gpt4o_accuracy": "45%",
        "o3_reasoning_tokens": {"o3-mini-low": 1500, "o3-mini-high": 8000},
        "claude_thinking": 6000,
    },
    "simple_qa": {
        "description": "Simple factual question",
        "input": 100,
        "output": 100,
        "gpt4o_accuracy": "98%",
        "o3_reasoning_tokens": {"o3-mini-low": 200, "o3-mini-high": 500},
        "claude_thinking": 1024,
    },
}


def calculate_cost(model: str, input_tokens: int, output_tokens: int, reasoning_tokens: int = 0) -> float:
    p = PRICING[model]
    return (
        (input_tokens / 1_000_000) * p["input_per_M"]
        + (output_tokens / 1_000_000) * p["output_per_M"]
        + (reasoning_tokens / 1_000_000) * p["reasoning_per_M"]
    )


print("Cost-Benefit Analysis: Reasoning vs Standard Models")
print("=" * 65)

for task_name, task in TASK_PROFILES.items():
    print(f"\nTask: {task['description']}")
    print(f"  GPT-4o accuracy (baseline): {task['gpt4o_accuracy']}")
    print(f"  {'Model':<22} {'$/request':>10} {'$1000 reqs':>12}")
    print(f"  {'-'*22} {'-'*10} {'-'*12}")

    for model in ["gpt-4o", "o3-mini-low", "o3-mini-high", "deepseek-r1-api", "claude-opus-4-6"]:
        r_tokens = 0
        if model in task["o3_reasoning_tokens"]:
            r_tokens = task["o3_reasoning_tokens"][model]
        elif model == "claude-opus-4-6":
            r_tokens = task["claude_thinking"]

        cost = calculate_cost(model, task["input"], task["output"], r_tokens)
        print(f"  {model:<22} ${cost:>9.5f} ${cost*1000:>11.2f}")

Part 6: Practical Comparison - Same Problem Across Models¶

We will run the same challenging problem through:

Standard GPT-4o (fast thinking)
o3-mini with high reasoning (slow thinking)
DeepSeek R1 via cloud API (open source reasoning)

Comparison dimensions: accuracy, latency, reasoning trace quality, token cost

COMPARISON_PROBLEM = """
A snail is at the bottom of a 30-foot well.
Each day it climbs up 3 feet, but each night it slides back 2 feet.
On what day does the snail reach or pass the top of the well?

Also: if the well were 100 feet deep instead of 30, what day would it escape?
Provide a general formula for a well of depth D, day-climb C, night-slide S.
"""

results = {}

# --- Model 1: Standard GPT-4o ---
def call_gpt4o(problem: str) -> dict:
    if not OPENAI_API_KEY:
        return {"text": None, "latency": None}

    client = OpenAI(api_key=OPENAI_API_KEY)
    start = time.time()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": problem}]
    )

    return {
        "text": response.choices[0].message.content,
        "latency": round(time.time() - start, 2),
        "input_tokens": response.usage.prompt_tokens,
        "output_tokens": response.usage.completion_tokens,
        "reasoning_tokens": 0
    }


print("Running comparison across models...")
print("(Models will be skipped if API keys are not set)")
print()

# GPT-4o
print("[1/3] Testing GPT-4o...")
results["gpt-4o"] = call_gpt4o(COMPARISON_PROBLEM)
if results["gpt-4o"]["text"]:
    print(f"    Done in {results['gpt-4o']['latency']}s")

# o3-mini high
print("[2/3] Testing o3-mini (high reasoning)...")
results["o3-mini-high"] = call_o3_mini(COMPARISON_PROBLEM, effort="high")
if results["o3-mini-high"]["text"]:
    print(f"    Done in {results['o3-mini-high']['latency']}s")

# DeepSeek R1 cloud
print("[3/3] Testing DeepSeek R1 (OpenRouter)...")
results["deepseek-r1"] = call_deepseek_r1_openrouter(COMPARISON_PROBLEM)
if results["deepseek-r1"]["text"]:
    print(f"    Done in {results['deepseek-r1']['latency']}s")

print("\nAll models tested.")

# Display comparison results
print("=" * 65)
print("COMPARISON RESULTS")
print("=" * 65)

for model_name, result in results.items():
    print(f"\nModel: {model_name}")
    print("-" * 40)

    if result.get("text") is None:
        print("  [Not run - API key not available]")
        continue

    print(f"  Latency        : {result.get('latency')}s")
    print(f"  Input tokens   : {result.get('input_tokens', 'N/A')}")
    print(f"  Output tokens  : {result.get('output_tokens', 'N/A')}")
    print(f"  Reasoning tkns : {result.get('reasoning_tokens', 'N/A')}")

    # Show thinking trace if available (DeepSeek)
    if result.get("think"):
        think_preview = result["think"][:300].replace("\n", " ")
        print(f"  Thinking trace : {think_preview}...")

    # Show answer
    answer_preview = result["text"][:400]
    print(f"  Answer preview : {answer_preview}...")

print()
print("Note: The correct answers are:")
print("  30-foot well : Day 28 (snail reaches top on day 28)")
print("  100-foot well: Day 98")
print("  General formula: ceil((D - C) / (C - S)) + 1 when C > S")

Part 7: Prompt Engineering for Reasoning Models¶

How Reasoning Models Differ¶

Standard models need prompts that guide them step-by-step. Reasoning models have internalized this.

Key differences:

Technique	Standard Model	Reasoning Model
Chain-of-thought examples	Very helpful	Redundant (can hurt)
“Think step by step”	Essential	Unnecessary
Verbose system prompts	OK	Worse (can constrain reasoning)
Specific output format	Required	State clearly once
Few-shot examples	Very helpful	Minimal needed
Constraints	Enumerate all	Trust model judgment

Best Practices¶

Be concise in the system prompt - long system prompts can anchor the reasoning in wrong directions
Don’t demonstrate reasoning - the model does it internally; examples can confuse it
State the goal, not the method - let the model find the best path
Avoid over-constraining - don’t say “first do X, then do Y” unless required
Specify output format clearly once - the model will follow formatting without being told how to think

# Prompt engineering: bad vs good prompts for reasoning models

CODING_PROBLEM = """
Given an array of integers, find the length of the longest subarray
where the absolute difference between any two elements is at most 1.

Example: [1, 3, 2, 2, 5, 2, 3, 7] -> 5 (subarray [3,2,2,2,3])
"""

# BAD prompt for reasoning model - over-constrains the thinking
BAD_PROMPT_SYSTEM = """
You are a coding assistant. When solving coding problems:
Step 1: First read the problem carefully.
Step 2: Think about brute force solutions.
Step 3: Then think about optimizations.
Step 4: Consider time complexity.
Step 5: Consider edge cases.
Step 6: Write the final solution.
Always show your reasoning. Use chain of thought.
Think out loud before writing code.
"""

# GOOD prompt for reasoning model - states the goal, trusts the model
GOOD_PROMPT_SYSTEM = """
You are an expert Python programmer.
Provide the optimal solution with time and space complexity analysis.
"""

print("BAD system prompt (over-constrains reasoning):")
print("-" * 40)
print(BAD_PROMPT_SYSTEM)
print("Problems:")
print("  - Forces a specific reasoning order that may not be optimal")
print("  - 'Think out loud' is redundant (reasoning model already does this internally)")
print("  - Over-verbose instructions consume input token budget")
print()
print("GOOD system prompt (concise, goal-focused):")
print("-" * 40)
print(GOOD_PROMPT_SYSTEM)
print("Why it's better:")
print("  - Concise: doesn't interfere with internal reasoning")
print("  - States the goal (optimal solution + analysis)")
print("  - Trusts the model to figure out HOW to reason")

# Additional prompt engineering tips with examples

prompt_tips = [
    {
        "tip": "Avoid few-shot examples that show reasoning steps",
        "bad": """Example 1:
Q: What is 5+3?
A: Let me think step by step. First I have 5. Then I add 3. So 5+3=8.

Example 2:
Q: What is 7*6?
A: Let me think step by step. I know 7*6... [etc]""",
        "good": """Q: What is 5+3? A: 8
Q: What is 7*6? A: 42""",
        "reason": "Showing reasoning examples trains the model to mimic your style, "
                  "overriding its more capable internal reasoning."
    },
    {
        "tip": "State constraints in the problem, not as process instructions",
        "bad": "First check if x > 0, then check if x < 100, then compute log(x)",
        "good": "Compute log(x) for x in range (0, 100), exclusive. Handle edge cases.",
        "reason": "Let the model decide HOW to validate. State WHAT is needed."
    },
    {
        "tip": "For multi-part problems, number parts clearly but don't prescribe order",
        "bad": "Part 1: Do A. Then for Part 2: build on Part 1 to do B. Then Part 3...",
        "good": "Answer all three parts:\n1. [Part A description]\n2. [Part B description]\n3. [Part C description]",
        "reason": "Clear structure without imposing sequential dependency the model may not need."
    }
]

for i, tip in enumerate(prompt_tips, 1):
    print(f"Tip {i}: {tip['tip']}")
    print(f"  Why: {tip['reason']}")
    print()

# Live test: concise prompt with o3-mini on a coding problem

def call_o3_mini_with_system(system: str, user: str, effort: str = "medium") -> dict:
    if not OPENAI_API_KEY:
        return {"text": None}

    client = OpenAI(api_key=OPENAI_API_KEY)
    start = time.time()

    response = client.chat.completions.create(
        model="o3-mini",
        reasoning_effort=effort,
        messages=[
            {"role": "developer", "content": system},  # o-series uses 'developer' role
            {"role": "user", "content": user}
        ]
    )

    return {
        "text": response.choices[0].message.content,
        "latency": round(time.time() - start, 2),
        "reasoning_tokens": getattr(
            response.usage.completion_tokens_details, "reasoning_tokens", "N/A"
        )
    }


print("Testing good prompt vs bad prompt on o3-mini:")
print()

print("[1/2] Good prompt (concise system):")
r_good = call_o3_mini_with_system(GOOD_PROMPT_SYSTEM, CODING_PROBLEM, effort="medium")
if r_good["text"]:
    print(f"  Latency: {r_good['latency']}s | Reasoning tokens: {r_good['reasoning_tokens']}")
    print(f"  Answer preview: {r_good['text'][:400]}...")

print()
print("[2/2] Bad prompt (over-constrained system):")
r_bad = call_o3_mini_with_system(BAD_PROMPT_SYSTEM, CODING_PROBLEM, effort="medium")
if r_bad["text"]:
    print(f"  Latency: {r_bad['latency']}s | Reasoning tokens: {r_bad['reasoning_tokens']}")
    print(f"  Answer preview: {r_bad['text'][:400]}...")

Part 8: Benchmark Comparison Table¶

Key Benchmarks Explained¶

Benchmark	What It Tests	Difficulty
AIME 2024/2025	American Invitational Math Exam	Extreme (top 5% of math competitors)
MATH-500	Hendrycks MATH dataset (500 problems)	Hard (grad school math)
HumanEval	Python function generation (164 problems)	Medium (interview-level coding)
SWE-bench Verified	Real GitHub issue resolution	Hard (professional software engineering)
GPQA Diamond	PhD-level science questions	Extreme

Performance Scores (as of early 2025)¶

# Benchmark comparison table
# Sources: OpenAI, Anthropic, DeepSeek technical reports (Jan-Apr 2025)

benchmarks = {
    "Model": [
        "GPT-4o (standard)",
        "o1 (2024)",
        "o3",
        "o3-mini (high)",
        "o4-mini",
        "DeepSeek R1 (671B)",
        "DeepSeek R1-Distill-7B",
        "Claude 3.7 Sonnet (extended)",
        "Claude Opus 4.6 (extended)",
    ],
    "Type": [
        "Standard",
        "Reasoning",
        "Reasoning",
        "Reasoning",
        "Reasoning",
        "Reasoning (OSS)",
        "Reasoning (OSS, small)",
        "Extended Thinking",
        "Extended Thinking",
    ],
    "AIME 2025 (%)": [
        "9.3",
        "74.3",
        "86.7",
        "79.6",
        "92.7",
        "70.0",
        "52.8",
        "80.0",
        "~85 (est.)",
    ],
    "MATH-500 (%)": [
        "74.6",
        "96.4",
        "97.9",
        "97.1",
        "97.6",
        "97.3",
        "89.1",
        "96.2",
        "~97 (est.)",
    ],
    "HumanEval (%)": [
        "90.2",
        "92.4",
        "~95",
        "94.0",
        "95.2",
        "92.6",
        "79.3",
        "93.7",
        "~95 (est.)",
    ],
    "SWE-bench (%)": [
        "38.5",
        "48.9",
        "71.7",
        "49.3",
        "68.1",
        "49.2",
        "N/A",
        "70.3",
        "~72 (est.)",
    ],
    "GPQA Diamond (%)": [
        "53.6",
        "78.3",
        "87.7",
        "79.7",
        "~81",
        "71.5",
        "49.1",
        "84.8",
        "~86 (est.)",
    ],
    "Open Source": [
        "No",
        "No",
        "No",
        "No",
        "No",
        "Yes (MIT)",
        "Yes (MIT)",
        "No",
        "No",
    ],
    "Run Locally": [
        "No",
        "No",
        "No",
        "No",
        "No",
        "Needs cluster",
        "Yes (7B via Ollama)",
        "No",
        "No",
    ],
}

# Print as a formatted table
col_widths = {k: max(len(k), max(len(str(v)) for v in vals))
              for k, vals in benchmarks.items()}

header = "  ".join(k.ljust(col_widths[k]) for k in benchmarks)
print(header)
print("-" * len(header))

n_rows = len(list(benchmarks.values())[0])
for i in range(n_rows):
    row = "  ".join(str(benchmarks[k][i]).ljust(col_widths[k]) for k in benchmarks)
    print(row)

# Key insights from the benchmark table

print("Key Insights from Benchmark Comparison")
print("=" * 55)
print()

insights = [
    {
        "finding": "Reasoning models dominate on AIME",
        "detail": "GPT-4o scores 9.3% vs o4-mini at 92.7% - a 10x improvement."
                  " Standard models fail at competition math."
    },
    {
        "finding": "DeepSeek R1 matches o1 at fraction of cost",
        "detail": "R1 671B scores 70% AIME vs o1 74.3%, but costs 10-20x less per token."
                  " Remarkable given open-source nature."
    },
    {
        "finding": "Small distilled models are surprisingly capable",
        "detail": "DeepSeek R1-Distill-7B scores 52.8% on AIME - beating GPT-4o (9.3%)"
                  " despite being 100x smaller."
    },
    {
        "finding": "SWE-bench is where reasoning shines for code",
        "detail": "o3 and Claude 3.7 both exceed 70% on SWE-bench Verified (real bug fixes)."
                  " GPT-4o is at 38.5%."
    },
    {
        "finding": "MATH-500 saturating - AIME is the harder discriminator",
        "detail": "Most reasoning models exceed 96% on MATH-500."
                  " AIME 2025 remains the hard discriminator."
    },
]

for i, insight in enumerate(insights, 1):
    print(f"{i}. {insight['finding']}")
    print(f"   {insight['detail']}")
    print()

Summary¶

What We Covered¶

Reasoning models use test-time compute scaling to “think” before answering
OpenAI o-series (o3, o4-mini) use reasoning_effort to control the compute budget
DeepSeek R1 is open-source reasoning: run via API or locally with Ollama; exposes <think> tags
Claude Extended Thinking uses budget_tokens to control thinking depth; thinking blocks are readable
Prompt engineering for reasoning models means being concise and trusting the model’s reasoning
Cost-benefit analysis shows reasoning models are only worth using for problems where accuracy matters

Quick Reference¶

Goal	Recommended Model	Key Parameter
Hard math / olympiad	o3 or o4-mini	`reasoning_effort="high"`
Balanced reasoning + cost	o3-mini	`reasoning_effort="medium"`
Open source reasoning (cloud)	DeepSeek R1 via OpenRouter	Standard chat API
Open source reasoning (local)	DeepSeek R1:7B via Ollama	`ollama.chat()`
Transparent reasoning trace	Claude Opus 4.6 extended	`budget_tokens=10000`
Fast, cheap, good enough	GPT-4o	Standard completion

Next Steps¶

Phase 12: RLHF and GRPO training - learn how R1 was built
Notebook 09: Agentic workflows with reasoning models (o3 as the “brain”)
Advanced: Combining reasoning models with tool use for complex agent tasks

# Final exercise: choose the right model for each task

print("Exercise: Match the task to the right model")
print("=" * 55)

tasks = [
    {"task": "Summarize a 5-page document",                     "answer": "GPT-4o or Claude 3.5 Sonnet (no reasoning needed)"},
    {"task": "Solve a 3-variable system of equations",          "answer": "GPT-4o is usually enough; o3-mini if accuracy critical"},
    {"task": "Debug why a distributed system deadlocks",        "answer": "o3-mini high or Claude extended thinking"},
    {"task": "Write a product description",                     "answer": "Standard model - creativity, not reasoning"},
    {"task": "Prove a number theory theorem",                   "answer": "o3 high or Claude extended (budget_tokens=20000)"},
    {"task": "Generate 10 tweet variations",                    "answer": "Standard model (GPT-4o, Claude 3.5 Sonnet)"},
    {"task": "Optimize a complex SQL query with 8 joins",       "answer": "o3-mini medium or DeepSeek R1"},
    {"task": "Private: analyze confidential medical data",      "answer": "DeepSeek R1:7B locally via Ollama (no data leaves machine)"},
]

for i, item in enumerate(tasks, 1):
    print(f"{i}. Task: {item['task']}")
    print(f"   Best choice: {item['answer']}")
    print()

print("Remember: reasoning models are more expensive and slower.")
print("Only use them when the accuracy boost justifies the cost.")