Run this notebook: Open in Colab Open in Kaggle

import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI()

def ask(prompt, model="gpt-3.5-turbo", temperature=0):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature
    )
    return response.choices[0].message.content

1. The Classic Example¶

The “bat and ball” problem is a well-known cognitive illusion from behavioral economics. Most humans instinctively answer “$0.10" because the system-1 brain subtracts $1.00 from $1.10 without considering the "more than" constraint. The correct answer requires solving a simple system of equations: $\text{bat} + \text{ball} = 1.10$ and $\text{bat} = \text{ball} + 1.00$, yielding $\text{ball} = 0.05$. LLMs without chain-of-thought prompting make the same mistake because they pattern-match on surface features. Adding “Let’s think step by step” forces the model to decompose the problem, dramatically increasing accuracy.

# ❌ Without CoT - Often gets it wrong!
prompt_no_cot = """
A bat and a ball cost $1.10 in total.
The bat costs $1.00 more than the ball.
How much does the ball cost?
"""

print("Without CoT:")
print(ask(prompt_no_cot))
print()

# ✅ With CoT - Much better!
prompt_with_cot = prompt_no_cot + "\nLet's think step by step:"

print("With CoT:")
print(ask(prompt_with_cot))

2. Zero-Shot CoT¶

Zero-shot chain-of-thought requires no examples at all – just appending a trigger phrase like “Let’s think step by step” to the prompt. This simple technique, introduced by Kojima et al. (2022), was shown to improve accuracy on arithmetic, symbolic reasoning, and commonsense tasks by 10-40%. The trigger phrase causes the model to generate intermediate reasoning tokens, which then condition the final answer. For the average speed problem, the model needs to compute total distance and total time before dividing, and the CoT trigger ensures these intermediate calculations appear explicitly in the output.

# Math problem
problem = """
If a train travels 120 miles in 2 hours, and then 90 miles in 1.5 hours,
what is its average speed for the entire journey?

Let's think step by step:
"""

print(ask(problem))

# Logic puzzle
puzzle = """
All roses are flowers.
Some flowers fade quickly.
Therefore, do all roses fade quickly?

Let's reason through this step by step:
"""

print(ask(puzzle))

3. Few-Shot CoT¶

Few-shot CoT combines the benefits of in-context examples with explicit reasoning chains. By showing complete worked examples – including intermediate steps and the final answer – you teach the model both the reasoning style and the expected output format. The examples below demonstrate a consistent pattern: state the starting value, show each arithmetic operation with its result, and label the final answer. This consistency helps the model replicate the same disciplined approach for novel problems, and the quality of the few-shot examples has a direct impact on reasoning accuracy.

prompt = """
Solve these word problems step-by-step:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
   Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls.
   Step 1: He bought 2 cans, each with 3 balls: 2 × 3 = 6 balls
   Step 2: Add to his original: 5 + 6 = 11 balls
   Answer: 11 tennis balls

Q: The cafeteria had 23 apples. If they used 20 to make lunch and
   bought 6 more, how many apples do they have?
A: Started with 23 apples.
   Step 1: Used 20 for lunch: 23 - 20 = 3 apples left
   Step 2: Bought 6 more: 3 + 6 = 9 apples
   Answer: 9 apples

Q: A parking lot has 12 spaces. 8 are occupied. 3 cars leave and
   5 new cars arrive. How many spaces are now occupied?
A:
"""

print(ask(prompt))

4. Self-Consistency¶

Self-consistency (Wang et al., 2022) improves CoT accuracy by sampling multiple reasoning paths and taking a majority vote on the final answer. The intuition is that while any single reasoning chain may contain errors, the correct answer tends to appear more frequently across diverse chains. By setting temperature > 0, each sample explores a different reasoning trajectory, and the consensus filters out occasional mistakes. For the egg carton problem, the correct answer should emerge consistently across paths: $15 \times 2 = 30$ eggs, $\lfloor 30 / 6 \rfloor = 5$ cartons. Self-consistency typically uses 5-20 samples and improves accuracy by 5-15% over single-sample CoT.

from collections import Counter
import re

def extract_answer(text):
    """Extract final numerical answer."""
    # Look for "Answer: X" pattern
    match = re.search(r'Answer:?\s*([\d.]+)', text, re.IGNORECASE)
    if match:
        return match.group(1)
    # Look for last number
    numbers = re.findall(r'\b\d+\.?\d*\b', text)
    return numbers[-1] if numbers else None

problem = """
A farmer has 15 chickens. Each chicken lays 2 eggs per day.
The farmer sells eggs in cartons of 6.
How many full cartons can he fill in one day?

Let's solve this step by step:
"""

# Generate 5 different reasoning paths
answers = []
print("Generating 5 reasoning paths...\n")

for i in range(5):
    response = ask(problem, temperature=0.7)  # Higher temperature for diversity
    answer = extract_answer(response)
    answers.append(answer)
    print(f"Path {i+1}: Answer = {answer}")

# Majority vote
counter = Counter(answers)
final_answer = counter.most_common(1)[0][0]

print(f"\nMajority vote: {final_answer}")
print(f"Vote distribution: {dict(counter)}")

5. Structured CoT¶

Structured CoT prescribes explicit step labels (Step 1, Step 2, …) that force the model to address specific aspects of the problem in a defined order. For a product review analysis, the structure ensures the model separately identifies positives, negatives, and value considerations before forming an overall judgment – rather than jumping to a conclusion based on the first feature mentioned. This technique is especially valuable for complex analytical tasks where you need the model’s reasoning to be auditable and where skipping a step could lead to an incomplete or biased conclusion.

prompt = """
Analyze this product review using the following structure:

Review: "The laptop is fast and the screen is beautiful, but it gets very hot
and the battery only lasts 3 hours. For $1200, I expected better."

Please analyze:

Step 1 - Identify positive aspects:
Step 2 - Identify negative aspects:
Step 3 - Consider price-value relationship:
Step 4 - Overall sentiment (positive/negative/mixed):
Step 5 - Recommendation (buy/don't buy/consider alternatives):
"""

print(ask(prompt))

6. CoT for Code Debugging¶

Chain-of-thought is highly effective for code debugging because it mirrors how experienced developers diagnose bugs: understand the intent, trace execution on the failing input, identify the discrepancy, and propose a fix. The function below initializes max_num = 0, which works for positive numbers but silently returns 0 for all-negative lists. By prompting the model to trace through find_max([-5, -2, -8]) step by step, it discovers that no element exceeds the initial value of 0, and recommends initializing with float('-inf') or numbers[0] instead.

code_debug = """
This function is supposed to find the maximum value in a list, but it's not working:

```python
def find_max(numbers):
    max_num = 0
    for num in numbers:
        if num > max_num:
            max_num = num
    return max_num
```

Test case that fails: find_max([-5, -2, -8]) returns 0, but should return -2

Let's debug this step by step:
1. What is the function trying to do?
2. What does it do on the failing test case?
3. Why does it fail?
4. How to fix it?
"""

print(ask(code_debug))

7. Least-to-Most Prompting¶

Least-to-most prompting (Zhou et al., 2022) breaks complex problems into a sequence of simpler subproblems, solving each one before tackling the next. The first LLM call decomposes the original question into subquestions (e.g., “What are hotel options?”, “What about transportation?”), and subsequent calls solve each subquestion in order, with each answer available as context for the next. This approach outperforms standard CoT on problems that require compositional reasoning – where the answer to one part depends on answers to earlier parts. It is particularly effective for planning tasks, multi-constraint optimization, and long-form analysis.

# First: Decompose the problem
decompose = """
Task: Plan a 3-day trip to Paris for a family of 4 on a $3000 budget.

First, let's break this into smaller questions we need to answer:
"""

print("=== Step 1: Decomposition ===")
subproblems = ask(decompose)
print(subproblems)

# Second: Solve each subproblem
solve_first = f"""
Based on these subproblems:
{subproblems}

Let's solve the first one in detail:
"""

print("\n=== Step 2: Solving First Subproblem ===")
print(ask(solve_first))

Best Practices¶

When to Use CoT¶

✅ DO use CoT for:

Math and arithmetic
Logic puzzles
Commonsense reasoning
Multi-step tasks
Complex analysis

❌ DON’T use CoT for:

Simple lookups
Straightforward classification
When speed > accuracy
Very clear, simple tasks

Tips¶

“Let’s think step by step” works surprisingly well
Temperature = 0 for consistent reasoning
Self-consistency for important decisions (5-10 samples)
Structure your reasoning format for complex tasks
Few-shot examples improve quality significantly

Cost vs. Quality¶

CoT uses more tokens (2-5x), but:

Higher accuracy (often 20-30% improvement)
Easier debugging
Better for high-stakes decisions
Worth it for complex reasoning

Exercise: Build a CoT Solver¶

Build a reusable function that combines chain-of-thought prompting with optional self-consistency to solve math word problems. The function should append the CoT trigger phrase, optionally sample multiple reasoning paths at higher temperature, extract numerical answers using regex, and return the majority vote. Test it on problems of varying difficulty to see where single-sample CoT fails and self-consistency recovers.

def solve_math_problem(problem, use_self_consistency=False, n_samples=5):
    """Solve math problem with CoT.
    
    Args:
        problem: Math word problem as string
        use_self_consistency: Whether to use multiple samples
        n_samples: Number of samples for self-consistency
    
    Returns:
        Final answer
    """
    prompt = f"{problem}\n\nLet's solve this step by step:"
    
    if use_self_consistency:
        answers = []
        for _ in range(n_samples):
            response = ask(prompt, temperature=0.7)
            answer = extract_answer(response)
            answers.append(answer)
        
        # Return majority vote
        counter = Counter(answers)
        return counter.most_common(1)[0][0]
    else:
        response = ask(prompt)
        return extract_answer(response)

# Test it
problem = """
A store sells notebooks for $3 each and pens for $2 each.
Sarah bought 5 notebooks and 8 pens.
She paid with a $50 bill.
How much change should she receive?
"""

print("Simple CoT:")
print(solve_math_problem(problem))

print("\nWith self-consistency:")
print(solve_math_problem(problem, use_self_consistency=True))

Key Takeaways¶

CoT = Better Reasoning: Step-by-step improves accuracy
“Let’s think step by step”: Simple magic phrase
Self-consistency: Multiple paths → robust answers
Structure helps: Guide the reasoning format
Trade-off: More tokens but better results

Next Steps¶

03_react_prompting.ipynb - Add tool use to reasoning
04_tree_of_thoughts.ipynb - Explore multiple reasoning branches
06_optimization.ipynb - Test and improve your prompts