import os
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI()
def ask(prompt, model="gpt-3.5-turbo", temperature=0):
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=temperature
)
return response.choices[0].message.content
1. The Classic ExampleΒΆ
The βbat and ballβ problem is a well-known cognitive illusion from behavioral economics. Most humans instinctively answer β\(0.10" because the system-1 brain subtracts \)1.00 from \(1.10 without considering the "more than" constraint. The correct answer requires solving a simple system of equations: \)\text{bat} + \text{ball} = 1.10\( and \)\text{bat} = \text{ball} + 1.00\(, yielding \)\text{ball} = 0.05$. LLMs without chain-of-thought prompting make the same mistake because they pattern-match on surface features. Adding βLetβs think step by stepβ forces the model to decompose the problem, dramatically increasing accuracy.
# β Without CoT - Often gets it wrong!
prompt_no_cot = """
A bat and a ball cost $1.10 in total.
The bat costs $1.00 more than the ball.
How much does the ball cost?
"""
print("Without CoT:")
print(ask(prompt_no_cot))
print()
# β
With CoT - Much better!
prompt_with_cot = prompt_no_cot + "\nLet's think step by step:"
print("With CoT:")
print(ask(prompt_with_cot))
2. Zero-Shot CoTΒΆ
Zero-shot chain-of-thought requires no examples at all β just appending a trigger phrase like βLetβs think step by stepβ to the prompt. This simple technique, introduced by Kojima et al. (2022), was shown to improve accuracy on arithmetic, symbolic reasoning, and commonsense tasks by 10-40%. The trigger phrase causes the model to generate intermediate reasoning tokens, which then condition the final answer. For the average speed problem, the model needs to compute total distance and total time before dividing, and the CoT trigger ensures these intermediate calculations appear explicitly in the output.
# Math problem
problem = """
If a train travels 120 miles in 2 hours, and then 90 miles in 1.5 hours,
what is its average speed for the entire journey?
Let's think step by step:
"""
print(ask(problem))
# Logic puzzle
puzzle = """
All roses are flowers.
Some flowers fade quickly.
Therefore, do all roses fade quickly?
Let's reason through this step by step:
"""
print(ask(puzzle))
3. Few-Shot CoTΒΆ
Few-shot CoT combines the benefits of in-context examples with explicit reasoning chains. By showing complete worked examples β including intermediate steps and the final answer β you teach the model both the reasoning style and the expected output format. The examples below demonstrate a consistent pattern: state the starting value, show each arithmetic operation with its result, and label the final answer. This consistency helps the model replicate the same disciplined approach for novel problems, and the quality of the few-shot examples has a direct impact on reasoning accuracy.
prompt = """
Solve these word problems step-by-step:
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls.
Step 1: He bought 2 cans, each with 3 balls: 2 Γ 3 = 6 balls
Step 2: Add to his original: 5 + 6 = 11 balls
Answer: 11 tennis balls
Q: The cafeteria had 23 apples. If they used 20 to make lunch and
bought 6 more, how many apples do they have?
A: Started with 23 apples.
Step 1: Used 20 for lunch: 23 - 20 = 3 apples left
Step 2: Bought 6 more: 3 + 6 = 9 apples
Answer: 9 apples
Q: A parking lot has 12 spaces. 8 are occupied. 3 cars leave and
5 new cars arrive. How many spaces are now occupied?
A:
"""
print(ask(prompt))
4. Self-ConsistencyΒΆ
Self-consistency (Wang et al., 2022) improves CoT accuracy by sampling multiple reasoning paths and taking a majority vote on the final answer. The intuition is that while any single reasoning chain may contain errors, the correct answer tends to appear more frequently across diverse chains. By setting temperature > 0, each sample explores a different reasoning trajectory, and the consensus filters out occasional mistakes. For the egg carton problem, the correct answer should emerge consistently across paths: \(15 \times 2 = 30\) eggs, \(\lfloor 30 / 6 \rfloor = 5\) cartons. Self-consistency typically uses 5-20 samples and improves accuracy by 5-15% over single-sample CoT.
from collections import Counter
import re
def extract_answer(text):
"""Extract final numerical answer."""
# Look for "Answer: X" pattern
match = re.search(r'Answer:?\s*([\d.]+)', text, re.IGNORECASE)
if match:
return match.group(1)
# Look for last number
numbers = re.findall(r'\b\d+\.?\d*\b', text)
return numbers[-1] if numbers else None
problem = """
A farmer has 15 chickens. Each chicken lays 2 eggs per day.
The farmer sells eggs in cartons of 6.
How many full cartons can he fill in one day?
Let's solve this step by step:
"""
# Generate 5 different reasoning paths
answers = []
print("Generating 5 reasoning paths...\n")
for i in range(5):
response = ask(problem, temperature=0.7) # Higher temperature for diversity
answer = extract_answer(response)
answers.append(answer)
print(f"Path {i+1}: Answer = {answer}")
# Majority vote
counter = Counter(answers)
final_answer = counter.most_common(1)[0][0]
print(f"\nMajority vote: {final_answer}")
print(f"Vote distribution: {dict(counter)}")
5. Structured CoTΒΆ
Structured CoT prescribes explicit step labels (Step 1, Step 2, β¦) that force the model to address specific aspects of the problem in a defined order. For a product review analysis, the structure ensures the model separately identifies positives, negatives, and value considerations before forming an overall judgment β rather than jumping to a conclusion based on the first feature mentioned. This technique is especially valuable for complex analytical tasks where you need the modelβs reasoning to be auditable and where skipping a step could lead to an incomplete or biased conclusion.
prompt = """
Analyze this product review using the following structure:
Review: "The laptop is fast and the screen is beautiful, but it gets very hot
and the battery only lasts 3 hours. For $1200, I expected better."
Please analyze:
Step 1 - Identify positive aspects:
Step 2 - Identify negative aspects:
Step 3 - Consider price-value relationship:
Step 4 - Overall sentiment (positive/negative/mixed):
Step 5 - Recommendation (buy/don't buy/consider alternatives):
"""
print(ask(prompt))
6. CoT for Code DebuggingΒΆ
Chain-of-thought is highly effective for code debugging because it mirrors how experienced developers diagnose bugs: understand the intent, trace execution on the failing input, identify the discrepancy, and propose a fix. The function below initializes max_num = 0, which works for positive numbers but silently returns 0 for all-negative lists. By prompting the model to trace through find_max([-5, -2, -8]) step by step, it discovers that no element exceeds the initial value of 0, and recommends initializing with float('-inf') or numbers[0] instead.
code_debug = """
This function is supposed to find the maximum value in a list, but it's not working:
```python
def find_max(numbers):
max_num = 0
for num in numbers:
if num > max_num:
max_num = num
return max_num
```
Test case that fails: find_max([-5, -2, -8]) returns 0, but should return -2
Let's debug this step by step:
1. What is the function trying to do?
2. What does it do on the failing test case?
3. Why does it fail?
4. How to fix it?
"""
print(ask(code_debug))
7. Least-to-Most PromptingΒΆ
Least-to-most prompting (Zhou et al., 2022) breaks complex problems into a sequence of simpler subproblems, solving each one before tackling the next. The first LLM call decomposes the original question into subquestions (e.g., βWhat are hotel options?β, βWhat about transportation?β), and subsequent calls solve each subquestion in order, with each answer available as context for the next. This approach outperforms standard CoT on problems that require compositional reasoning β where the answer to one part depends on answers to earlier parts. It is particularly effective for planning tasks, multi-constraint optimization, and long-form analysis.
# First: Decompose the problem
decompose = """
Task: Plan a 3-day trip to Paris for a family of 4 on a $3000 budget.
First, let's break this into smaller questions we need to answer:
"""
print("=== Step 1: Decomposition ===")
subproblems = ask(decompose)
print(subproblems)
# Second: Solve each subproblem
solve_first = f"""
Based on these subproblems:
{subproblems}
Let's solve the first one in detail:
"""
print("\n=== Step 2: Solving First Subproblem ===")
print(ask(solve_first))
Best PracticesΒΆ
When to Use CoTΒΆ
β DO use CoT for:
Math and arithmetic
Logic puzzles
Commonsense reasoning
Multi-step tasks
Complex analysis
β DONβT use CoT for:
Simple lookups
Straightforward classification
When speed > accuracy
Very clear, simple tasks
TipsΒΆ
βLetβs think step by stepβ works surprisingly well
Temperature = 0 for consistent reasoning
Self-consistency for important decisions (5-10 samples)
Structure your reasoning format for complex tasks
Few-shot examples improve quality significantly
Cost vs. QualityΒΆ
CoT uses more tokens (2-5x), but:
Higher accuracy (often 20-30% improvement)
Easier debugging
Better for high-stakes decisions
Worth it for complex reasoning
Exercise: Build a CoT SolverΒΆ
Build a reusable function that combines chain-of-thought prompting with optional self-consistency to solve math word problems. The function should append the CoT trigger phrase, optionally sample multiple reasoning paths at higher temperature, extract numerical answers using regex, and return the majority vote. Test it on problems of varying difficulty to see where single-sample CoT fails and self-consistency recovers.
def solve_math_problem(problem, use_self_consistency=False, n_samples=5):
"""Solve math problem with CoT.
Args:
problem: Math word problem as string
use_self_consistency: Whether to use multiple samples
n_samples: Number of samples for self-consistency
Returns:
Final answer
"""
prompt = f"{problem}\n\nLet's solve this step by step:"
if use_self_consistency:
answers = []
for _ in range(n_samples):
response = ask(prompt, temperature=0.7)
answer = extract_answer(response)
answers.append(answer)
# Return majority vote
counter = Counter(answers)
return counter.most_common(1)[0][0]
else:
response = ask(prompt)
return extract_answer(response)
# Test it
problem = """
A store sells notebooks for $3 each and pens for $2 each.
Sarah bought 5 notebooks and 8 pens.
She paid with a $50 bill.
How much change should she receive?
"""
print("Simple CoT:")
print(solve_math_problem(problem))
print("\nWith self-consistency:")
print(solve_math_problem(problem, use_self_consistency=True))
Key TakeawaysΒΆ
CoT = Better Reasoning: Step-by-step improves accuracy
βLetβs think step by stepβ: Simple magic phrase
Self-consistency: Multiple paths β robust answers
Structure helps: Guide the reasoning format
Trade-off: More tokens but better results
Next StepsΒΆ
03_react_prompting.ipynb- Add tool use to reasoning04_tree_of_thoughts.ipynb- Explore multiple reasoning branches06_optimization.ipynb- Test and improve your prompts