Run this notebook: Open in Colab Open in Kaggle

# Setup (December 2025)
import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

def ask(prompt, model="gpt-4o-mini", temperature=0.7):
    """Helper to call LLM.
    
    December 2025 Models:
    - gpt-4o: Best overall (multimodal, 128k context)
    - gpt-4o-mini: Fast & cheap (good for examples)
    - o1-preview: Best reasoning (complex problems)
    - o1-mini: Fast reasoning (coding, math)
    - gpt-4-turbo: Legacy flagship
    """
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature
    )
    return response.choices[0].message.content

1. Zero-Shot Prompting¶

Zero-shot prompting asks the model to perform a task without providing any examples – relying entirely on knowledge acquired during pre-training. For simple factual questions (“What is the capital of France?”), zero-shot works perfectly because the answer is well-represented in training data. For more complex tasks like sentiment classification, zero-shot still performs well when the task description is clear and the output format is constrained. Zero-shot is the fastest way to prototype any LLM application, and it serves as the baseline against which few-shot and fine-tuned approaches are measured.

# Zero-shot: Direct question
prompt = "What is the capital of France?"
print(ask(prompt))

# Zero-shot: Complex task
prompt = """
Classify the sentiment of this review as positive, negative, or neutral:

Review: "The product works okay but the customer service was terrible."

Sentiment:
"""
print(ask(prompt, temperature=0))

2. One-Shot Prompting¶

One-shot prompting provides a single input-output example to guide the model’s behavior. Even one example can dramatically improve output quality by demonstrating the expected format, level of detail, and style. The example below shows how to convert a product description into a specific JSON schema – the model learns the exact key names, data types, and extraction logic from this single demonstration. One-shot is particularly effective when the output format is non-obvious or when you need the model to follow a specific convention that differs from its default behavior.

# One-shot: Show format
prompt = """
Convert product descriptions to JSON format.

Example:
Input: "iPhone 15 Pro in blue, 256GB storage, $999"
Output: {"product": "iPhone 15 Pro", "color": "blue", "storage": "256GB", "price": 999}

Now convert this:
Input: "Samsung Galaxy S24 in black, 512GB storage, $1199"
Output:
"""
print(ask(prompt, temperature=0))

3. Few-Shot Prompting¶

Few-shot prompting provides multiple examples (typically 2-5) to establish a clear pattern. Each additional example reduces ambiguity: with three classification examples covering Bug, Feature Request, and Question, the model has seen every possible output class and understands the decision boundary between them. Research shows that few-shot performance scales with example diversity – covering edge cases and boundary examples matters more than raw quantity. The examples also implicitly teach tone, format, and length, making few-shot the most reliable technique for production classification and extraction tasks.

# Few-shot: Classification task
prompt = """
Classify customer feedback as Bug, Feature Request, or Question.

Example 1:
Feedback: "The app crashes when I upload large files"
Category: Bug

Example 2:
Feedback: "Can you add dark mode?"
Category: Feature Request

Example 3:
Feedback: "How do I export my data?"
Category: Question

Now classify:
Feedback: "It would be great to have keyboard shortcuts"
Category:
"""
print(ask(prompt, temperature=0))

# Few-shot: Code generation with style
prompt = """
Write Python functions following this style:

Example 1:
Task: Calculate factorial
def factorial(n: int) -> int:
    '''Calculate factorial of n.
    
    Args:
        n: Non-negative integer
    
    Returns:
        Factorial of n
    '''
    if n <= 1:
        return 1
    return n * factorial(n - 1)

Example 2:
Task: Check if prime
def is_prime(n: int) -> bool:
    '''Check if number is prime.
    
    Args:
        n: Integer to check
    
    Returns:
        True if prime, False otherwise
    '''
    if n < 2:
        return False
    for i in range(2, int(n ** 0.5) + 1):
        if n % i == 0:
            return False
    return True

Now write:
Task: Calculate fibonacci number
"""
print(ask(prompt, temperature=0))

4. Dynamic Few-Shot Selection¶

Rather than using the same fixed examples for every input, dynamic few-shot selection retrieves the most relevant examples from a pool based on semantic similarity to the current query. The approach uses embeddings to represent both the query and each candidate example as vectors, then selects the top-$k$ most similar examples via cosine similarity: $\text{similarity} = \frac{\mathbf{q} \cdot \mathbf{e}}{\|\mathbf{q}\| \|\mathbf{e}\|}$. This ensures the model sees examples closest to the current input, improving accuracy on diverse inputs. In production, the example pool is stored in a vector database and retrieved at inference time – a pattern closely related to retrieval-augmented generation (RAG).

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Example pool
examples = [
    {"input": "The food was delicious!", "output": "positive"},
    {"input": "Terrible service, never coming back", "output": "negative"},
    {"input": "It was okay, nothing special", "output": "neutral"},
    {"input": "Best restaurant in town!", "output": "positive"},
    {"input": "Worst experience ever", "output": "negative"},
]

def get_embeddings(texts):
    """Get embeddings for similarity."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [item.embedding for item in response.data]

def select_examples(query, examples, k=2):
    """Select k most similar examples."""
    # Get embeddings
    all_texts = [query] + [ex["input"] for ex in examples]
    embeddings = get_embeddings(all_texts)
    
    query_emb = np.array(embeddings[0]).reshape(1, -1)
    example_embs = np.array(embeddings[1:])
    
    # Compute similarities
    similarities = cosine_similarity(query_emb, example_embs)[0]
    
    # Get top k
    top_indices = similarities.argsort()[-k:][::-1]
    return [examples[i] for i in top_indices]

# Test
query = "Amazing food and great atmosphere!"
selected = select_examples(query, examples, k=2)

# Build prompt with selected examples
prompt = "Classify sentiment:\n\n"
for ex in selected:
    prompt += f"Input: {ex['input']}\nOutput: {ex['output']}\n\n"
prompt += f"Input: {query}\nOutput:"

print("Selected examples:", [ex["input"] for ex in selected])
print("\nResult:", ask(prompt, temperature=0))

5. Best Practices¶

✅ DO¶

Use few-shot for consistency
Provide diverse examples
Show edge cases
Use clear delimiters
Set temperature=0 for consistency

❌ DON’T¶

Use conflicting examples
Provide too many examples (>10)
Forget to test edge cases
Use vague examples

Exercise: Build Your Own Classifier¶

Put the techniques above into practice by building a few-shot prompt for programming language identification. Start with 2-3 examples covering languages with distinct syntax (Python, JavaScript, Java), then test with edge cases like TypeScript or Ruby to see where the classifier struggles. Experiment with adding more examples or adjusting the prompt structure to improve accuracy.

# Your turn!
prompt = """
Identify the programming language:

Example 1:
Code: def hello(): print("Hello")
Language: Python

# Add 2-3 more examples
# ...

Now identify:
Code: const greeting = () => console.log("Hello");
Language:
"""

# Test your prompt
# print(ask(prompt, temperature=0))

Key Takeaways¶

Zero-shot: Quick, works for simple tasks
One-shot: Good for showing format
Few-shot: Best for consistency and complex tasks
Dynamic selection: Choose relevant examples automatically
Quality > Quantity: 3-5 good examples beat 20 mediocre ones

Model Selection (December 2025)¶

For Prompting Tasks:

gpt-4o-mini: Best value for most examples (fast, cheap, quality)
gpt-4o: Best overall quality (multimodal, 128k context)
o1-mini: When reasoning matters (code, math, logic)
o1-preview: Complex reasoning (research, analysis)

Cost vs Quality:

Development/Testing: gpt-4o-mini ($0.15/1M input)
Production: gpt-4o ($2.50/1M input)
Reasoning: o1-mini ($3/1M input)

Next Steps¶

02_chain_of_thought.ipynb - Make models reason step-by-step
03_react_prompting.ipynb - Combine reasoning with actions
05_prompt_templates.ipynb - Build reusable prompt systems