Run this notebook: Open in Colab Open in Kaggle

Structured LLM Outputs & Programmatic Prompting (2025-2026)¶

LLMs return free-form text. Applications need structured, typed data. This notebook covers the four most important tools for taming LLM outputs:

Tool	Approach	Best For
Instructor	Retry + parse after generation	API-backed models, multi-provider
Outlines	Token-level constraint at generation time	Local models, guaranteed format
PydanticAI	Type-safe agents with dependency injection	Production agents, typed pipelines
DSPy	Declarative self-improving prompts	Auto-optimized pipelines, prompt brittleness

Prerequisites

pip install instructor outlines pydantic-ai dspy openai anthropic transformers torch

import os
from dotenv import load_dotenv

load_dotenv()

# Verify key is available
api_key = os.getenv('OPENAI_API_KEY')
print('OpenAI key found:', bool(api_key))

Part 1 — Instructor¶

3 Million Monthly Downloads. The Standard for Structured Extraction.¶

The problem: Every app that uses LLMs eventually writes the same boilerplate:

Call the model
Parse the text response
Validate it matches expected structure
Retry if parsing fails
Map it to a Python object

Instructor’s solution: Wrap any OpenAI-compatible client with one line. Pass a Pydantic model as response_model. Get back a typed Python object — with automatic retry on validation failure.

# pip install instructor
import instructor
from openai import OpenAI
from pydantic import BaseModel, EmailStr, field_validator

# One-line patch: wraps the standard OpenAI client
client = instructor.from_openai(OpenAI())

# Define your expected shape with Pydantic
class User(BaseModel):
    name: str
    age: int
    email: str

# Extract structured data from natural language
user = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=User,
    messages=[{
        "role": "user",
        "content": "Extract: John Doe, 30 years old, john@example.com"
    }]
)

print(f"Name : {user.name}")
print(f"Age  : {user.age}")
print(f"Email: {user.email}")
print(f"Type : {type(user)}")

1.2 Nested Models — Complex Document Extraction¶

Real documents have hierarchy. Instructor handles nested Pydantic models naturally.

from pydantic import BaseModel
from typing import List, Optional

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str

class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
    total: float

class Invoice(BaseModel):
    invoice_number: str
    vendor_name: str
    vendor_address: Address
    line_items: List[LineItem]
    subtotal: float
    tax_rate: float
    total_due: float
    due_date: str

raw_invoice = """
INVOICE #INV-2025-0042
From: Acme Corp, 123 Main St, Springfield, IL 62701
Due: March 15, 2025

Items:
- 5x Widget Pro @ $49.99 each = $249.95
- 2x Support Contract @ $199.00 each = $398.00
- 1x Setup Fee @ $75.00 = $75.00

Subtotal: $722.95
Tax (8.5%): $61.45
TOTAL DUE: $784.40
"""

invoice = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=Invoice,
    messages=[{
        "role": "user",
        "content": f"Extract all invoice data from this document:\n{raw_invoice}"
    }]
)

print(f"Invoice #: {invoice.invoice_number}")
print(f"Vendor:    {invoice.vendor_name}")
print(f"City:      {invoice.vendor_address.city}, {invoice.vendor_address.state}")
print(f"\nLine Items:")
for item in invoice.line_items:
    print(f"  {item.description}: {item.quantity} x ${item.unit_price:.2f} = ${item.total:.2f}")
print(f"\nTotal Due: ${invoice.total_due:.2f}")
print(f"Due Date:  {invoice.due_date}")

1.3 Semantic Validation — LLM-Powered Field Validation¶

Pydantic validators normally use code logic. Instructor lets you write validators in plain English using llm_validator.

from instructor import llm_validator
from pydantic import field_validator
from typing import Annotated

class ProductReview(BaseModel):
    product_name: str
    rating: int  # 1-5
    review_text: Annotated[
        str,
        llm_validator(
            "Must be a genuine product review. Should not contain spam, "
            "promotional content, or irrelevant information.",
            openai_client=OpenAI()
        )
    ]

    @field_validator('rating')
    @classmethod
    def rating_in_range(cls, v):
        if not 1 <= v <= 5:
            raise ValueError('Rating must be between 1 and 5')
        return v

# Valid review
valid = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=ProductReview,
    messages=[{
        "role": "user",
        "content": "Product: Wireless Earbuds. Rating: 4. Review: Great sound quality, "
                   "comfortable fit, battery lasts 8 hours. Mic quality could be better."
    }]
)
print("Valid review extracted:")
print(f"  Product: {valid.product_name}")
print(f"  Rating:  {valid.rating}/5")
print(f"  Review:  {valid.review_text[:80]}...")

1.4 Automatic Retry on Validation Failure¶

When the LLM returns something that fails Pydantic validation, Instructor automatically sends the error back to the model and asks it to fix the output — up to max_retries times.

from pydantic import field_validator

class StrictAge(BaseModel):
    name: str
    age: int

    @field_validator('age')
    @classmethod
    def must_be_adult(cls, v):
        if v < 18:
            raise ValueError(f'Age {v} is below minimum of 18. Must provide age for an adult.')
        return v

# Instructor will retry if the model gives age < 18
# max_retries=3 means up to 3 correction attempts
person = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=StrictAge,
    max_retries=3,
    messages=[{
        "role": "user",
        "content": "Extract adult contact: Sarah Johnson, 25 years old."
    }]
)
print(f"Extracted adult: {person.name}, age {person.age}")

1.5 Streaming Partial Objects¶

For long responses or large structured objects, Instructor supports streaming. You receive partial, validated objects as tokens arrive.

from instructor import Partial

class ResearchSummary(BaseModel):
    title: str
    key_findings: List[str]
    methodology: str
    conclusion: str

print("Streaming partial object...\n")

# stream=True returns an iterator of partial ResearchSummary objects
for partial_summary in client.chat.completions.create_partial(
    model="gpt-4o-mini",
    response_model=ResearchSummary,
    messages=[{
        "role": "user",
        "content": "Summarize a study on how sleep affects memory consolidation."
    }],
    stream=True
):
    # Each iteration gives more of the object filled in
    if partial_summary.title:
        print(f"\rTitle: {partial_summary.title[:60]}", end="", flush=True)

print("\n\nFinal object:")
print(f"Title: {partial_summary.title}")
print(f"Findings: {len(partial_summary.key_findings or [])} items")
print(f"Conclusion: {(partial_summary.conclusion or '')[:100]}...")

1.6 Multi-Provider Support¶

Instructor patches any OpenAI-compatible client with the same API. Swap providers by changing one line.

# ----- Anthropic Claude -----
# import anthropic
# claude_client = instructor.from_anthropic(anthropic.Anthropic())
# result = claude_client.messages.create(
#     model="claude-3-5-haiku-20241022",
#     response_model=User,
#     messages=[{"role": "user", "content": "Extract: Alice, 28, alice@example.com"}],
#     max_tokens=1024
# )

# ----- Ollama (local) -----
# from openai import OpenAI as OllamaClient
# ollama_client = instructor.from_openai(
#     OllamaClient(base_url="http://localhost:11434/v1", api_key="ollama"),
#     mode=instructor.Mode.JSON
# )
# result = ollama_client.chat.completions.create(
#     model="llama3.2",
#     response_model=User,
#     messages=[{"role": "user", "content": "Extract: Alice, 28, alice@example.com"}]
# )

# ----- Google Gemini -----
# import google.generativeai as genai
# gemini_client = instructor.from_gemini(
#     client=genai.GenerativeModel(model_name="models/gemini-1.5-flash-latest")
# )

print("Instructor supports 15+ providers:")
providers = [
    "OpenAI (gpt-4o, gpt-4o-mini, o1)",
    "Anthropic (claude-3-5-sonnet, claude-3-haiku)",
    "Google Gemini (gemini-1.5-flash, gemini-1.5-pro)",
    "Ollama (llama3, mistral, qwen — local)",
    "Cohere, Mistral, Groq, Fireworks, Together AI",
    "Azure OpenAI, AWS Bedrock, Vertex AI",
]
for p in providers:
    print(f"  - {p}")

print("\nSame response_model= API across all providers.")

Part 2 — Outlines¶

Token-Level Constrained Generation¶

The key difference from Instructor:

Instructor: Generate text freely, then parse + retry until it fits
Outlines: At each token step, mask out tokens that would violate the schema

This means Outlines physically cannot produce invalid output. No retries needed. Works at inference time on local models.

Step 1: Model wants to emit next token
Step 2: Outlines checks JSON schema / regex / grammar
Step 3: Any token that would make output invalid gets probability = 0
Step 4: Model picks from valid tokens only
Result: Output is always valid — by construction

# pip install outlines transformers torch
# Outlines works with local models via HuggingFace transformers

# ----- Basic JSON generation from local model -----
# import outlines
# from pydantic import BaseModel
#
# # Load a small local model (downloads ~1GB on first run)
# model = outlines.models.transformers("Qwen/Qwen2.5-1.5B-Instruct")
#
# class User(BaseModel):
#     name: str
#     age: int
#     email: str
#
# # Create a JSON generator constrained to User schema
# generator = outlines.generate.json(model, User)
#
# user = generator(
#     "Extract user information from: Alice Smith, 32 years old, alice@example.com"
# )
# print(user)  # Always valid User — guaranteed
# print(type(user))  # <class '__main__.User'>

print("Outlines constrained generation concepts:")
print()
print("outlines.generate.json(model, PydanticModel)  -> always-valid JSON")
print("outlines.generate.regex(model, pattern)       -> regex-constrained output")
print("outlines.generate.choice(model, options)      -> force pick from list")
print("outlines.generate.cfg(model, grammar)         -> context-free grammar")
print()
print("Supported backends:")
print("  outlines.models.transformers('Qwen/Qwen2.5-1.5B-Instruct')")
print("  outlines.models.llamacpp(model_path)")
print("  outlines.models.vllm('mistralai/Mistral-7B-v0.1')")

2.2 Regex-Constrained Generation¶

Enforce exact formats — dates, phone numbers, IDs — without any post-processing.

# ----- Regex patterns Outlines can enforce -----

# US phone number: (555) 867-5309
PHONE_PATTERN = r"\([0-9]{3}\) [0-9]{3}-[0-9]{4}"

# ISO date: 2025-03-15
DATE_PATTERN = r"[0-9]{4}-[0-9]{2}-[0-9]{2}"

# US ZIP code: 90210 or 90210-1234
ZIP_PATTERN = r"[0-9]{5}(-[0-9]{4})?"

# Credit card number: 4242 4242 4242 4242
CC_PATTERN = r"[0-9]{4} [0-9]{4} [0-9]{4} [0-9]{4}"

# IPv4 address
IPV4_PATTERN = r"([0-9]{1,3}\.){3}[0-9]{1,3}"

# With Outlines (commented since requires GPU/large model):
# generator = outlines.generate.regex(model, DATE_PATTERN)
# date = generator("When did WWII end? Answer with just the date:")
# # Output will ALWAYS match YYYY-MM-DD — no parsing needed

import re

# Simulate what Outlines guarantees (demo without local model)
examples = [
    ("Phone", PHONE_PATTERN, "(415) 555-1234"),
    ("Date", DATE_PATTERN, "2025-03-15"),
    ("ZIP", ZIP_PATTERN, "94102-3456"),
    ("IPv4", IPV4_PATTERN, "192.168.1.100"),
]

print("Pattern validation (what Outlines guarantees at token level):")
for name, pattern, example in examples:
    match = bool(re.fullmatch(pattern, example))
    print(f"  {name:10} | Pattern: {pattern:35} | Example: {example:20} | Valid: {match}")

2.3 Choice Selection and Outlines with Ollama¶

Force the model to pick from a predefined list of options — useful for classification, routing, and enum fields.

# ----- Choice selection (local model) -----
# With Outlines, the model CAN ONLY emit one of the listed tokens
# No "I think it's probably positive" — just "positive"
#
# import outlines
#
# model = outlines.models.transformers("Qwen/Qwen2.5-1.5B-Instruct")
#
# # Force sentiment to exactly one of three values
# sentiment_gen = outlines.generate.choice(model, ["positive", "negative", "neutral"])
# sentiment = sentiment_gen("Classify: I absolutely love this product!")
# # sentiment is guaranteed to be "positive", "negative", or "neutral" — nothing else
#
# # Force routing decision
# route_gen = outlines.generate.choice(model, ["billing", "technical", "sales", "general"])
# department = route_gen("My credit card was charged twice. Route to:")

# ----- Outlines + Ollama -----
# import outlines
# model = outlines.models.ollama("llama3.2", "http://localhost:11434")
# generator = outlines.generate.json(model, InvoiceSchema)
# result = generator("Extract invoice from: ...")

print("Outlines vs Instructor — when to use which:")
print()
comparison = [
    ("Generation approach", "Retry after failure",    "Block invalid tokens"),
    ("Model type",          "API models (GPT, Claude)","Local (transformers, vLLM, llama.cpp)"),
    ("Guarantee",           "Eventually valid (retries)","Always valid (construction)"),
    ("Latency",             "Higher (retries cost tokens)","Lower (no retries)"),
    ("Formats",             "Any Pydantic schema",    "JSON, regex, choice, CFG"),
    ("GPU required",        "No",                     "Yes (local inference)"),
    ("Cost",                "Per token (API)",         "Infrastructure (local)"),
]
print(f"{'Feature':<22} | {'Instructor':<35} | {'Outlines'}")
print("-" * 90)
for row in comparison:
    print(f"{row[0]:<22} | {row[1]:<35} | {row[2]}")

Part 3 — PydanticAI¶

Type-Safe Agents with Dependency Injection¶

Instructor gives you structured extraction. PydanticAI gives you structured agents — with:

Type-safe result models (Pydantic)
Dependency injection for tools (databases, APIs, config)
Automatic retry when validation fails
Streaming structured outputs
First-class support for multi-step reasoning

# pip install pydantic-ai
from pydantic_ai import Agent
from pydantic import BaseModel

class WeatherResult(BaseModel):
    city: str
    temperature_f: float
    temperature_c: float
    description: str
    recommendation: str

# Agent with structured result type
weather_agent = Agent(
    'openai:gpt-4o-mini',
    result_type=WeatherResult,
    system_prompt=(
        "You are a helpful weather assistant. Always provide both Fahrenheit and Celsius. "
        "Give a practical recommendation for the conditions."
    )
)

result = weather_agent.run_sync("What's a typical mid-March day like in Chicago?")

weather = result.data
print(f"City:           {weather.city}")
print(f"Temperature:    {weather.temperature_f}F / {weather.temperature_c}C")
print(f"Conditions:     {weather.description}")
print(f"Recommendation: {weather.recommendation}")

3.2 Tools with Dependency Injection¶

PydanticAI’s key feature: inject real dependencies (databases, API clients, config) into tools without globals.

from pydantic_ai import Agent, RunContext
from pydantic import BaseModel
from dataclasses import dataclass
from typing import Dict

# --- Dependency: a simple product catalog ---
@dataclass
class ProductDB:
    """Injected database dependency."""
    products: Dict[str, dict]

    def get_product(self, name: str) -> dict | None:
        return self.products.get(name.lower())

    def search(self, category: str) -> list:
        return [p for p in self.products.values() if p.get('category') == category]

class ProductRecommendation(BaseModel):
    product_name: str
    price: float
    reason: str
    in_stock: bool

# Create agent with typed dependency
product_agent: Agent[ProductDB, ProductRecommendation] = Agent(
    'openai:gpt-4o-mini',
    deps_type=ProductDB,
    result_type=ProductRecommendation,
    system_prompt="You are a product recommendation assistant. Use the available tools to look up products."
)

@product_agent.tool
def lookup_product(ctx: RunContext[ProductDB], product_name: str) -> str:
    """Look up a specific product by name."""
    product = ctx.deps.get_product(product_name)
    if product:
        return str(product)
    return f"Product '{product_name}' not found"

@product_agent.tool
def search_category(ctx: RunContext[ProductDB], category: str) -> str:
    """Search products in a category."""
    items = ctx.deps.search(category)
    return str(items) if items else f"No products in category '{category}'"

# Inject real dependency at runtime
db = ProductDB(products={
    "wireless headphones": {"name": "Wireless Headphones", "price": 79.99, "category": "audio", "in_stock": True},
    "noise cancelling buds": {"name": "Noise Cancelling Buds", "price": 149.99, "category": "audio", "in_stock": True},
    "wired earphones": {"name": "Wired Earphones", "price": 29.99, "category": "audio", "in_stock": False},
})

result = product_agent.run_sync(
    "Recommend a good audio product for someone on a $100 budget who works in a noisy office.",
    deps=db
)

rec = result.data
print(f"Recommended: {rec.product_name}")
print(f"Price:       ${rec.price:.2f}")
print(f"In Stock:    {rec.in_stock}")
print(f"Reason:      {rec.reason}")

3.3 Streaming Structured Outputs¶

PydanticAI supports streaming structured results. You get validated partial objects as the model generates.

import asyncio
from pydantic_ai import Agent
from pydantic import BaseModel
from typing import List

class TechAnalysis(BaseModel):
    technology: str
    pros: List[str]
    cons: List[str]
    verdict: str
    score: int  # 1-10

analysis_agent = Agent(
    'openai:gpt-4o-mini',
    result_type=TechAnalysis,
    system_prompt="Provide balanced technical analysis."
)

async def stream_analysis(topic: str):
    async with analysis_agent.run_stream(f"Analyze: {topic}") as stream:
        async for partial in stream.stream_structured():
            # Each iteration = more fields filled in
            if partial.technology:
                print(f"\rAnalyzing: {partial.technology}", end="", flush=True)
        result = await stream.get_data()
    return result

# Run async
analysis = asyncio.run(stream_analysis("Rust vs Python for data pipelines"))
print(f"\nTechnology: {analysis.technology}")
print(f"Score:      {analysis.score}/10")
print(f"Pros:       {analysis.pros[:2]}")
print(f"Verdict:    {analysis.verdict[:100]}...")

3.4 PydanticAI vs LangChain¶

Both build LLM applications. They have different philosophies.

comparison = [
    ("Type safety",        "Full Pydantic types end-to-end",    "Partial (improving in v0.3)"),
    ("Dependency inject.", "First-class, typed RunContext",      "Workaround via config/callback"),
    ("Learning curve",     "Low (standard Python + Pydantic)",  "Steep (many abstractions)"),
    ("Ecosystem",          "Smaller, focused",                   "Large (100s of integrations)"),
    ("Streaming",          "Typed partial models",               "Token streams + callbacks"),
    ("Testing",            "Easy (inject mock deps)",            "Complex (mock chains)"),
    ("Observability",      "Logfire (native)",                   "LangSmith, Langfuse"),
    ("Memory",             "Bring your own",                     "Built-in (ConversationBuffer, etc)"),
    ("Agent framework",    "Agents, tools, retries built-in",    "LangGraph for complex flows"),
    ("Best for",           "Type-safe production agents",        "Rapid prototyping, broad integrations"),
]

print(f"{'Feature':<22} | {'PydanticAI':<40} | {'LangChain'}")
print("-" * 100)
for row in comparison:
    print(f"{row[0]:<22} | {row[1]:<40} | {row[2]}")

Part 4 — DSPy¶

Declarative Self-Improving Python¶

The core problem with manual prompt engineering:

You spend days crafting a prompt that works great on GPT-4o
You switch to Claude or Llama — it breaks
You update the model version — it degrades
You can’t explain why a particular phrasing works
Adding more examples or context is trial and error

DSPy’s answer: Stop writing prompts. Declare what you want (input fields → output fields). Let the optimizer find the best prompts automatically through data-driven search.

Traditional:  You → write prompts → LLM → outputs
DSPy:         You → define signatures → DSPy optimizer → optimized prompts → LLM → outputs

# pip install dspy
import dspy
from typing import Literal

# Configure the language model
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

# --- Signature: the contract between input and output ---
# The docstring becomes the task description in the prompt
class SentimentAnalysis(dspy.Signature):
    """Classify the sentiment of the given text."""
    text: str = dspy.InputField(desc="The text to analyze")
    sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField(
        desc="Sentiment classification"
    )
    confidence: float = dspy.OutputField(
        desc="Confidence score between 0 and 1"
    )

# Predict module: simplest DSPy module
classifier = dspy.Predict(SentimentAnalysis)

# Test it
tests = [
    "I absolutely love this product! Best purchase I've made.",
    "The package arrived damaged and customer support was useless.",
    "It's okay. Does what it says, nothing special.",
]

print("DSPy Predict (no manual prompt needed):")
print()
for text in tests:
    result = classifier(text=text)
    print(f"Text:       {text[:60]}..." if len(text) > 60 else f"Text:       {text}")
    print(f"Sentiment:  {result.sentiment}")
    print(f"Confidence: {result.confidence:.2f}")
    print()

4.2 ChainOfThought — Automatic Reasoning¶

Replace dspy.Predict with dspy.ChainOfThought and DSPy automatically adds step-by-step reasoning to the prompt — no manual “think step by step” required.

class MathWordProblem(dspy.Signature):
    """Solve a math word problem step by step."""
    problem: str = dspy.InputField()
    answer: float = dspy.OutputField(desc="The numerical answer")
    unit: str = dspy.OutputField(desc="Unit of the answer (e.g. 'dollars', 'miles', 'hours')")

# ChainOfThought automatically includes reasoning in the prompt
solver = dspy.ChainOfThought(MathWordProblem)

problems = [
    "A train travels at 60 mph. If it leaves at 9am and arrives at 2pm, how far did it travel?",
    "A store has 144 apples. They sell 60% on Monday and 25% of what's left on Tuesday. How many remain?",
]

for problem in problems:
    result = solver(problem=problem)
    print(f"Problem: {problem}")
    print(f"Reasoning: {result.reasoning[:150]}...")
    print(f"Answer: {result.answer} {result.unit}")
    print()

4.3 Building a Multi-Step DSPy Pipeline¶

DSPy modules compose like PyTorch layers. Build complex pipelines by combining signatures.

from typing import List

# --- Step 1: Extract key claims from a document ---
class ExtractClaims(dspy.Signature):
    """Extract the main factual claims from a document."""
    document: str = dspy.InputField()
    claims: List[str] = dspy.OutputField(
        desc="List of specific factual claims made in the document"
    )

# --- Step 2: Assess credibility of each claim ---
class AssessClaim(dspy.Signature):
    """Assess whether a claim is likely true, false, or uncertain."""
    claim: str = dspy.InputField()
    verdict: Literal["likely_true", "likely_false", "uncertain"] = dspy.OutputField()
    reasoning: str = dspy.OutputField(desc="Brief explanation of the assessment")

# --- Step 3: Compose into a fact-checking pipeline ---
class FactChecker(dspy.Module):
    def __init__(self):
        super().__init__()
        self.extract = dspy.Predict(ExtractClaims)
        self.assess = dspy.ChainOfThought(AssessClaim)

    def forward(self, document: str) -> dict:
        # Extract claims from document
        extraction = self.extract(document=document)
        
        # Assess each claim
        assessments = []
        for claim in extraction.claims[:3]:  # limit to 3 for speed
            assessment = self.assess(claim=claim)
            assessments.append({
                "claim": claim,
                "verdict": assessment.verdict,
                "reasoning": assessment.reasoning
            })
        
        return {
            "claims_found": len(extraction.claims),
            "assessments": assessments
        }

fact_checker = FactChecker()

article = """
Scientists at MIT announced that daily coffee consumption reduces Alzheimer's risk by 65%.
The study followed 10,000 participants over 20 years. Coffee contains antioxidants that 
protect neurons. The WHO recommends 3 cups per day for adults over 50. Green tea has
similar effects, according to separate research from Harvard Medical School.
"""

results = fact_checker(document=article)
print(f"Claims found: {results['claims_found']}")
print()
for a in results['assessments']:
    print(f"Claim:    {a['claim'][:80]}..." if len(a['claim']) > 80 else f"Claim:    {a['claim']}")
    print(f"Verdict:  {a['verdict']}")
    print(f"Reason:   {a['reasoning'][:100]}...")
    print()

4.4 MIPRO Optimizer — Auto-Generate Better Prompts¶

MIPRO (Multi-prompt Instruction PRoposal Optimizer) is DSPy’s most powerful optimizer. It:

Proposes new instruction candidates
Generates few-shot demonstrations from your training data
Runs a Bayesian search over instruction + demo combinations
Evaluates each combination against your metric
Returns the best-performing prompt configuration

# Full optimization pipeline (requires training data + evaluation metric)
# This demonstrates the structure — run with larger data for real optimization

import dspy
from dspy.evaluate import Evaluate
from dspy.teleprompt import MIPROv2

# --- 1. Define the task ---
class QuestionAnswer(dspy.Signature):
    """Answer questions accurately and concisely."""
    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="A concise, accurate answer")

class QAModule(dspy.Module):
    def __init__(self):
        super().__init__()
        self.qa = dspy.ChainOfThought(QuestionAnswer)

    def forward(self, question: str) -> dspy.Prediction:
        return self.qa(question=question)

# --- 2. Training data (DSPy examples) ---
train_data = [
    dspy.Example(question="What is the capital of France?", answer="Paris").with_inputs("question"),
    dspy.Example(question="Who wrote Hamlet?", answer="William Shakespeare").with_inputs("question"),
    dspy.Example(question="What is 15% of 200?", answer="30").with_inputs("question"),
    dspy.Example(question="What year did the Berlin Wall fall?", answer="1989").with_inputs("question"),
    dspy.Example(question="How many bones in the human body?", answer="206").with_inputs("question"),
]

# Dev set for evaluation
dev_data = [
    dspy.Example(question="What is the speed of light in km/s?", answer="299,792").with_inputs("question"),
    dspy.Example(question="Who painted the Mona Lisa?", answer="Leonardo da Vinci").with_inputs("question"),
]

# --- 3. Evaluation metric ---
def exact_match_metric(example, prediction, trace=None):
    """Check if key words from the expected answer appear in the prediction."""
    expected = example.answer.lower()
    predicted = prediction.answer.lower()
    # Simple check: expected answer is contained in prediction
    return expected in predicted

# Test baseline (unoptimized)
baseline = QAModule()
evaluator = Evaluate(devset=dev_data, metric=exact_match_metric, num_threads=1)
baseline_score = evaluator(baseline)
print(f"Baseline score: {baseline_score:.1f}%")

# --- 4. Optimize with MIPROv2 ---
# (Using light settings for demonstration — production would use more trials)
print("\nRunning MIPROv2 optimization...")
optimizer = MIPROv2(
    metric=exact_match_metric,
    auto="light"  # 'light', 'medium', or 'heavy'
)

optimized_qa = optimizer.compile(
    QAModule(),
    trainset=train_data,
    num_trials=10,   # more trials = better results
    requires_permission_to_run=False  # skip interactive prompt
)

# Evaluate optimized version
optimized_score = evaluator(optimized_qa)
print(f"Optimized score: {optimized_score:.1f}%")
print(f"Improvement: +{optimized_score - baseline_score:.1f}%")

4.5 Inspect What the Optimizer Found¶

DSPy prompts are transparent — you can see exactly what the optimizer generated.

# Inspect the optimized prompt
print("=== Optimized Prompt (auto-generated by MIPRO) ===")
print()
try:
    # Get the actual prompt that will be sent to the LLM
    lm.inspect_history(n=1)
except Exception:
    pass

# See the signatures with their optimized instructions
for name, module in optimized_qa.named_predictors():
    print(f"Module: {name}")
    print(f"Instructions: {module.signature.instructions}")
    print(f"Demos: {len(module.demos)} few-shot examples")
    if module.demos:
        print(f"First demo: {module.demos[0]}")
    print()

# Save optimized program for reuse
# optimized_qa.save("optimized_qa.json")
# Later: loaded_qa = QAModule(); loaded_qa.load("optimized_qa.json")
print("To persist: optimized_qa.save('optimized_qa.json')")
print("To reload:  qa = QAModule(); qa.load('optimized_qa.json')")

4.6 DSPy vs Manual Prompting — When to Use Each¶

print("=== DSPy vs Manual Prompting ===")
print()

use_dspy = [
    "You have labeled data and want to maximize accuracy",
    "Prompts break when you switch models (GPT -> Claude, etc.)",
    "You have a complex multi-step pipeline to optimize end-to-end",
    "You need to iterate quickly without manual prompt engineering",
    "You update your LLM provider and don't want to re-engineer prompts",
    "Production pipelines where small accuracy gains have high ROI",
]

use_manual = [
    "One-off scripts or prototypes",
    "You have no training data to optimize against",
    "The task is simple and prompts are stable",
    "Tight latency requirements (optimization adds overhead)",
    "You need full control over exact prompt wording",
]

print("USE DSPy when:")
for item in use_dspy:
    print(f"  + {item}")

print()
print("USE manual prompts when:")
for item in use_manual:
    print(f"  - {item}")

print()
print("=" * 60)
print("Key insight: DSPy doesn't replace prompts — it writes them for you.")
print("You define WHAT you want. DSPy figures out HOW to ask for it.")

Summary — Choosing the Right Tool¶

                    STRUCTURED OUTPUT LANDSCAPE (2025-2026)

  API MODEL?                      LOCAL MODEL?
  
  Need typed extraction? ──────── Need 100% format guarantee?
       |                                   |
   Instructor                           Outlines
  (parse + retry)                 (token-level mask)
  
  Building agents?                 Optimizing pipelines?
       |                                   |
   PydanticAI                           DSPy
  (type-safe deps)                 (auto-prompt search)

Tool	Install	Core Value
Instructor	`pip install instructor`	Pydantic models from any LLM API, 15+ providers
Outlines	`pip install outlines`	Token-level constraints for local models
PydanticAI	`pip install pydantic-ai`	Type-safe agents with dependency injection
DSPy	`pip install dspy`	Auto-optimize prompts from data, not intuition

The production stack (2025-2026):

Use Instructor for data extraction from documents, APIs, emails
Use Outlines for local model deployments where format must be guaranteed
Use PydanticAI for multi-tool agents that need testability and type safety
Use DSPy when prompt brittleness is costing you — let data drive the prompts

# Quick reference: install commands
installs = [
    ("Instructor",  "pip install instructor"),
    ("Outlines",    "pip install outlines transformers torch"),
    ("PydanticAI",  "pip install pydantic-ai"),
    ("DSPy",        "pip install dspy"),
    ("All at once", "pip install instructor outlines pydantic-ai dspy"),
]

print("Install commands:")
for name, cmd in installs:
    print(f"  {name:<12}  {cmd}")

print()
print("Documentation:")
docs = [
    ("Instructor",  "https://python.useinstructor.com"),
    ("Outlines",    "https://dottxt-ai.github.io/outlines"),
    ("PydanticAI",  "https://ai.pydantic.dev"),
    ("DSPy",        "https://dspy.ai"),
]
for name, url in docs:
    print(f"  {name:<12}  {url}")