Structured LLM Outputs & Programmatic Prompting (2025-2026)ΒΆ

LLMs return free-form text. Applications need structured, typed data. This notebook covers the four most important tools for taming LLM outputs:

Tool

Approach

Best For

Instructor

Retry + parse after generation

API-backed models, multi-provider

Outlines

Token-level constraint at generation time

Local models, guaranteed format

PydanticAI

Type-safe agents with dependency injection

Production agents, typed pipelines

DSPy

Declarative self-improving prompts

Auto-optimized pipelines, prompt brittleness

Prerequisites

pip install instructor outlines pydantic-ai dspy openai anthropic transformers torch
import os
from dotenv import load_dotenv

load_dotenv()

# Verify key is available
api_key = os.getenv('OPENAI_API_KEY')
print('OpenAI key found:', bool(api_key))

Part 1 β€” InstructorΒΆ

3 Million Monthly Downloads. The Standard for Structured Extraction.ΒΆ

The problem: Every app that uses LLMs eventually writes the same boilerplate:

  1. Call the model

  2. Parse the text response

  3. Validate it matches expected structure

  4. Retry if parsing fails

  5. Map it to a Python object

Instructor’s solution: Wrap any OpenAI-compatible client with one line. Pass a Pydantic model as response_model. Get back a typed Python object β€” with automatic retry on validation failure.

# pip install instructor
import instructor
from openai import OpenAI
from pydantic import BaseModel, EmailStr, field_validator

# One-line patch: wraps the standard OpenAI client
client = instructor.from_openai(OpenAI())

# Define your expected shape with Pydantic
class User(BaseModel):
    name: str
    age: int
    email: str

# Extract structured data from natural language
user = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=User,
    messages=[{
        "role": "user",
        "content": "Extract: John Doe, 30 years old, john@example.com"
    }]
)

print(f"Name : {user.name}")
print(f"Age  : {user.age}")
print(f"Email: {user.email}")
print(f"Type : {type(user)}")

1.2 Nested Models β€” Complex Document ExtractionΒΆ

Real documents have hierarchy. Instructor handles nested Pydantic models naturally.

from pydantic import BaseModel
from typing import List, Optional

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str

class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
    total: float

class Invoice(BaseModel):
    invoice_number: str
    vendor_name: str
    vendor_address: Address
    line_items: List[LineItem]
    subtotal: float
    tax_rate: float
    total_due: float
    due_date: str

raw_invoice = """
INVOICE #INV-2025-0042
From: Acme Corp, 123 Main St, Springfield, IL 62701
Due: March 15, 2025

Items:
- 5x Widget Pro @ $49.99 each = $249.95
- 2x Support Contract @ $199.00 each = $398.00
- 1x Setup Fee @ $75.00 = $75.00

Subtotal: $722.95
Tax (8.5%): $61.45
TOTAL DUE: $784.40
"""

invoice = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=Invoice,
    messages=[{
        "role": "user",
        "content": f"Extract all invoice data from this document:\n{raw_invoice}"
    }]
)

print(f"Invoice #: {invoice.invoice_number}")
print(f"Vendor:    {invoice.vendor_name}")
print(f"City:      {invoice.vendor_address.city}, {invoice.vendor_address.state}")
print(f"\nLine Items:")
for item in invoice.line_items:
    print(f"  {item.description}: {item.quantity} x ${item.unit_price:.2f} = ${item.total:.2f}")
print(f"\nTotal Due: ${invoice.total_due:.2f}")
print(f"Due Date:  {invoice.due_date}")

1.3 Semantic Validation β€” LLM-Powered Field ValidationΒΆ

Pydantic validators normally use code logic. Instructor lets you write validators in plain English using llm_validator.

from instructor import llm_validator
from pydantic import field_validator
from typing import Annotated

class ProductReview(BaseModel):
    product_name: str
    rating: int  # 1-5
    review_text: Annotated[
        str,
        llm_validator(
            "Must be a genuine product review. Should not contain spam, "
            "promotional content, or irrelevant information.",
            openai_client=OpenAI()
        )
    ]

    @field_validator('rating')
    @classmethod
    def rating_in_range(cls, v):
        if not 1 <= v <= 5:
            raise ValueError('Rating must be between 1 and 5')
        return v

# Valid review
valid = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=ProductReview,
    messages=[{
        "role": "user",
        "content": "Product: Wireless Earbuds. Rating: 4. Review: Great sound quality, "
                   "comfortable fit, battery lasts 8 hours. Mic quality could be better."
    }]
)
print("Valid review extracted:")
print(f"  Product: {valid.product_name}")
print(f"  Rating:  {valid.rating}/5")
print(f"  Review:  {valid.review_text[:80]}...")

1.4 Automatic Retry on Validation FailureΒΆ

When the LLM returns something that fails Pydantic validation, Instructor automatically sends the error back to the model and asks it to fix the output β€” up to max_retries times.

from pydantic import field_validator

class StrictAge(BaseModel):
    name: str
    age: int

    @field_validator('age')
    @classmethod
    def must_be_adult(cls, v):
        if v < 18:
            raise ValueError(f'Age {v} is below minimum of 18. Must provide age for an adult.')
        return v

# Instructor will retry if the model gives age < 18
# max_retries=3 means up to 3 correction attempts
person = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=StrictAge,
    max_retries=3,
    messages=[{
        "role": "user",
        "content": "Extract adult contact: Sarah Johnson, 25 years old."
    }]
)
print(f"Extracted adult: {person.name}, age {person.age}")

1.5 Streaming Partial ObjectsΒΆ

For long responses or large structured objects, Instructor supports streaming. You receive partial, validated objects as tokens arrive.

from instructor import Partial

class ResearchSummary(BaseModel):
    title: str
    key_findings: List[str]
    methodology: str
    conclusion: str

print("Streaming partial object...\n")

# stream=True returns an iterator of partial ResearchSummary objects
for partial_summary in client.chat.completions.create_partial(
    model="gpt-4o-mini",
    response_model=ResearchSummary,
    messages=[{
        "role": "user",
        "content": "Summarize a study on how sleep affects memory consolidation."
    }],
    stream=True
):
    # Each iteration gives more of the object filled in
    if partial_summary.title:
        print(f"\rTitle: {partial_summary.title[:60]}", end="", flush=True)

print("\n\nFinal object:")
print(f"Title: {partial_summary.title}")
print(f"Findings: {len(partial_summary.key_findings or [])} items")
print(f"Conclusion: {(partial_summary.conclusion or '')[:100]}...")

1.6 Multi-Provider SupportΒΆ

Instructor patches any OpenAI-compatible client with the same API. Swap providers by changing one line.

# ----- Anthropic Claude -----
# import anthropic
# claude_client = instructor.from_anthropic(anthropic.Anthropic())
# result = claude_client.messages.create(
#     model="claude-3-5-haiku-20241022",
#     response_model=User,
#     messages=[{"role": "user", "content": "Extract: Alice, 28, alice@example.com"}],
#     max_tokens=1024
# )

# ----- Ollama (local) -----
# from openai import OpenAI as OllamaClient
# ollama_client = instructor.from_openai(
#     OllamaClient(base_url="http://localhost:11434/v1", api_key="ollama"),
#     mode=instructor.Mode.JSON
# )
# result = ollama_client.chat.completions.create(
#     model="llama3.2",
#     response_model=User,
#     messages=[{"role": "user", "content": "Extract: Alice, 28, alice@example.com"}]
# )

# ----- Google Gemini -----
# import google.generativeai as genai
# gemini_client = instructor.from_gemini(
#     client=genai.GenerativeModel(model_name="models/gemini-1.5-flash-latest")
# )

print("Instructor supports 15+ providers:")
providers = [
    "OpenAI (gpt-4o, gpt-4o-mini, o1)",
    "Anthropic (claude-3-5-sonnet, claude-3-haiku)",
    "Google Gemini (gemini-1.5-flash, gemini-1.5-pro)",
    "Ollama (llama3, mistral, qwen β€” local)",
    "Cohere, Mistral, Groq, Fireworks, Together AI",
    "Azure OpenAI, AWS Bedrock, Vertex AI",
]
for p in providers:
    print(f"  - {p}")

print("\nSame response_model= API across all providers.")

Part 2 β€” OutlinesΒΆ

Token-Level Constrained GenerationΒΆ

The key difference from Instructor:

  • Instructor: Generate text freely, then parse + retry until it fits

  • Outlines: At each token step, mask out tokens that would violate the schema

This means Outlines physically cannot produce invalid output. No retries needed. Works at inference time on local models.

Step 1: Model wants to emit next token
Step 2: Outlines checks JSON schema / regex / grammar
Step 3: Any token that would make output invalid gets probability = 0
Step 4: Model picks from valid tokens only
Result: Output is always valid β€” by construction
# pip install outlines transformers torch
# Outlines works with local models via HuggingFace transformers

# ----- Basic JSON generation from local model -----
# import outlines
# from pydantic import BaseModel
#
# # Load a small local model (downloads ~1GB on first run)
# model = outlines.models.transformers("Qwen/Qwen2.5-1.5B-Instruct")
#
# class User(BaseModel):
#     name: str
#     age: int
#     email: str
#
# # Create a JSON generator constrained to User schema
# generator = outlines.generate.json(model, User)
#
# user = generator(
#     "Extract user information from: Alice Smith, 32 years old, alice@example.com"
# )
# print(user)  # Always valid User β€” guaranteed
# print(type(user))  # <class '__main__.User'>

print("Outlines constrained generation concepts:")
print()
print("outlines.generate.json(model, PydanticModel)  -> always-valid JSON")
print("outlines.generate.regex(model, pattern)       -> regex-constrained output")
print("outlines.generate.choice(model, options)      -> force pick from list")
print("outlines.generate.cfg(model, grammar)         -> context-free grammar")
print()
print("Supported backends:")
print("  outlines.models.transformers('Qwen/Qwen2.5-1.5B-Instruct')")
print("  outlines.models.llamacpp(model_path)")
print("  outlines.models.vllm('mistralai/Mistral-7B-v0.1')")

2.2 Regex-Constrained GenerationΒΆ

Enforce exact formats β€” dates, phone numbers, IDs β€” without any post-processing.

# ----- Regex patterns Outlines can enforce -----

# US phone number: (555) 867-5309
PHONE_PATTERN = r"\([0-9]{3}\) [0-9]{3}-[0-9]{4}"

# ISO date: 2025-03-15
DATE_PATTERN = r"[0-9]{4}-[0-9]{2}-[0-9]{2}"

# US ZIP code: 90210 or 90210-1234
ZIP_PATTERN = r"[0-9]{5}(-[0-9]{4})?"

# Credit card number: 4242 4242 4242 4242
CC_PATTERN = r"[0-9]{4} [0-9]{4} [0-9]{4} [0-9]{4}"

# IPv4 address
IPV4_PATTERN = r"([0-9]{1,3}\.){3}[0-9]{1,3}"

# With Outlines (commented since requires GPU/large model):
# generator = outlines.generate.regex(model, DATE_PATTERN)
# date = generator("When did WWII end? Answer with just the date:")
# # Output will ALWAYS match YYYY-MM-DD β€” no parsing needed

import re

# Simulate what Outlines guarantees (demo without local model)
examples = [
    ("Phone", PHONE_PATTERN, "(415) 555-1234"),
    ("Date", DATE_PATTERN, "2025-03-15"),
    ("ZIP", ZIP_PATTERN, "94102-3456"),
    ("IPv4", IPV4_PATTERN, "192.168.1.100"),
]

print("Pattern validation (what Outlines guarantees at token level):")
for name, pattern, example in examples:
    match = bool(re.fullmatch(pattern, example))
    print(f"  {name:10} | Pattern: {pattern:35} | Example: {example:20} | Valid: {match}")

2.3 Choice Selection and Outlines with OllamaΒΆ

Force the model to pick from a predefined list of options β€” useful for classification, routing, and enum fields.

# ----- Choice selection (local model) -----
# With Outlines, the model CAN ONLY emit one of the listed tokens
# No "I think it's probably positive" β€” just "positive"
#
# import outlines
#
# model = outlines.models.transformers("Qwen/Qwen2.5-1.5B-Instruct")
#
# # Force sentiment to exactly one of three values
# sentiment_gen = outlines.generate.choice(model, ["positive", "negative", "neutral"])
# sentiment = sentiment_gen("Classify: I absolutely love this product!")
# # sentiment is guaranteed to be "positive", "negative", or "neutral" β€” nothing else
#
# # Force routing decision
# route_gen = outlines.generate.choice(model, ["billing", "technical", "sales", "general"])
# department = route_gen("My credit card was charged twice. Route to:")

# ----- Outlines + Ollama -----
# import outlines
# model = outlines.models.ollama("llama3.2", "http://localhost:11434")
# generator = outlines.generate.json(model, InvoiceSchema)
# result = generator("Extract invoice from: ...")

print("Outlines vs Instructor β€” when to use which:")
print()
comparison = [
    ("Generation approach", "Retry after failure",    "Block invalid tokens"),
    ("Model type",          "API models (GPT, Claude)","Local (transformers, vLLM, llama.cpp)"),
    ("Guarantee",           "Eventually valid (retries)","Always valid (construction)"),
    ("Latency",             "Higher (retries cost tokens)","Lower (no retries)"),
    ("Formats",             "Any Pydantic schema",    "JSON, regex, choice, CFG"),
    ("GPU required",        "No",                     "Yes (local inference)"),
    ("Cost",                "Per token (API)",         "Infrastructure (local)"),
]
print(f"{'Feature':<22} | {'Instructor':<35} | {'Outlines'}")
print("-" * 90)
for row in comparison:
    print(f"{row[0]:<22} | {row[1]:<35} | {row[2]}")

Part 3 β€” PydanticAIΒΆ

Type-Safe Agents with Dependency InjectionΒΆ

Instructor gives you structured extraction. PydanticAI gives you structured agents β€” with:

  • Type-safe result models (Pydantic)

  • Dependency injection for tools (databases, APIs, config)

  • Automatic retry when validation fails

  • Streaming structured outputs

  • First-class support for multi-step reasoning

# pip install pydantic-ai
from pydantic_ai import Agent
from pydantic import BaseModel

class WeatherResult(BaseModel):
    city: str
    temperature_f: float
    temperature_c: float
    description: str
    recommendation: str

# Agent with structured result type
weather_agent = Agent(
    'openai:gpt-4o-mini',
    result_type=WeatherResult,
    system_prompt=(
        "You are a helpful weather assistant. Always provide both Fahrenheit and Celsius. "
        "Give a practical recommendation for the conditions."
    )
)

result = weather_agent.run_sync("What's a typical mid-March day like in Chicago?")

weather = result.data
print(f"City:           {weather.city}")
print(f"Temperature:    {weather.temperature_f}F / {weather.temperature_c}C")
print(f"Conditions:     {weather.description}")
print(f"Recommendation: {weather.recommendation}")

3.2 Tools with Dependency InjectionΒΆ

PydanticAI’s key feature: inject real dependencies (databases, API clients, config) into tools without globals.

from pydantic_ai import Agent, RunContext
from pydantic import BaseModel
from dataclasses import dataclass
from typing import Dict

# --- Dependency: a simple product catalog ---
@dataclass
class ProductDB:
    """Injected database dependency."""
    products: Dict[str, dict]

    def get_product(self, name: str) -> dict | None:
        return self.products.get(name.lower())

    def search(self, category: str) -> list:
        return [p for p in self.products.values() if p.get('category') == category]

class ProductRecommendation(BaseModel):
    product_name: str
    price: float
    reason: str
    in_stock: bool

# Create agent with typed dependency
product_agent: Agent[ProductDB, ProductRecommendation] = Agent(
    'openai:gpt-4o-mini',
    deps_type=ProductDB,
    result_type=ProductRecommendation,
    system_prompt="You are a product recommendation assistant. Use the available tools to look up products."
)

@product_agent.tool
def lookup_product(ctx: RunContext[ProductDB], product_name: str) -> str:
    """Look up a specific product by name."""
    product = ctx.deps.get_product(product_name)
    if product:
        return str(product)
    return f"Product '{product_name}' not found"

@product_agent.tool
def search_category(ctx: RunContext[ProductDB], category: str) -> str:
    """Search products in a category."""
    items = ctx.deps.search(category)
    return str(items) if items else f"No products in category '{category}'"

# Inject real dependency at runtime
db = ProductDB(products={
    "wireless headphones": {"name": "Wireless Headphones", "price": 79.99, "category": "audio", "in_stock": True},
    "noise cancelling buds": {"name": "Noise Cancelling Buds", "price": 149.99, "category": "audio", "in_stock": True},
    "wired earphones": {"name": "Wired Earphones", "price": 29.99, "category": "audio", "in_stock": False},
})

result = product_agent.run_sync(
    "Recommend a good audio product for someone on a $100 budget who works in a noisy office.",
    deps=db
)

rec = result.data
print(f"Recommended: {rec.product_name}")
print(f"Price:       ${rec.price:.2f}")
print(f"In Stock:    {rec.in_stock}")
print(f"Reason:      {rec.reason}")

3.3 Streaming Structured OutputsΒΆ

PydanticAI supports streaming structured results. You get validated partial objects as the model generates.

import asyncio
from pydantic_ai import Agent
from pydantic import BaseModel
from typing import List

class TechAnalysis(BaseModel):
    technology: str
    pros: List[str]
    cons: List[str]
    verdict: str
    score: int  # 1-10

analysis_agent = Agent(
    'openai:gpt-4o-mini',
    result_type=TechAnalysis,
    system_prompt="Provide balanced technical analysis."
)

async def stream_analysis(topic: str):
    async with analysis_agent.run_stream(f"Analyze: {topic}") as stream:
        async for partial in stream.stream_structured():
            # Each iteration = more fields filled in
            if partial.technology:
                print(f"\rAnalyzing: {partial.technology}", end="", flush=True)
        result = await stream.get_data()
    return result

# Run async
analysis = asyncio.run(stream_analysis("Rust vs Python for data pipelines"))
print(f"\nTechnology: {analysis.technology}")
print(f"Score:      {analysis.score}/10")
print(f"Pros:       {analysis.pros[:2]}")
print(f"Verdict:    {analysis.verdict[:100]}...")

3.4 PydanticAI vs LangChainΒΆ

Both build LLM applications. They have different philosophies.

comparison = [
    ("Type safety",        "Full Pydantic types end-to-end",    "Partial (improving in v0.3)"),
    ("Dependency inject.", "First-class, typed RunContext",      "Workaround via config/callback"),
    ("Learning curve",     "Low (standard Python + Pydantic)",  "Steep (many abstractions)"),
    ("Ecosystem",          "Smaller, focused",                   "Large (100s of integrations)"),
    ("Streaming",          "Typed partial models",               "Token streams + callbacks"),
    ("Testing",            "Easy (inject mock deps)",            "Complex (mock chains)"),
    ("Observability",      "Logfire (native)",                   "LangSmith, Langfuse"),
    ("Memory",             "Bring your own",                     "Built-in (ConversationBuffer, etc)"),
    ("Agent framework",    "Agents, tools, retries built-in",    "LangGraph for complex flows"),
    ("Best for",           "Type-safe production agents",        "Rapid prototyping, broad integrations"),
]

print(f"{'Feature':<22} | {'PydanticAI':<40} | {'LangChain'}")
print("-" * 100)
for row in comparison:
    print(f"{row[0]:<22} | {row[1]:<40} | {row[2]}")

Part 4 β€” DSPyΒΆ

Declarative Self-Improving PythonΒΆ

The core problem with manual prompt engineering:

  1. You spend days crafting a prompt that works great on GPT-4o

  2. You switch to Claude or Llama β€” it breaks

  3. You update the model version β€” it degrades

  4. You can’t explain why a particular phrasing works

  5. Adding more examples or context is trial and error

DSPy’s answer: Stop writing prompts. Declare what you want (input fields β†’ output fields). Let the optimizer find the best prompts automatically through data-driven search.

Traditional:  You β†’ write prompts β†’ LLM β†’ outputs
DSPy:         You β†’ define signatures β†’ DSPy optimizer β†’ optimized prompts β†’ LLM β†’ outputs
# pip install dspy
import dspy
from typing import Literal

# Configure the language model
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

# --- Signature: the contract between input and output ---
# The docstring becomes the task description in the prompt
class SentimentAnalysis(dspy.Signature):
    """Classify the sentiment of the given text."""
    text: str = dspy.InputField(desc="The text to analyze")
    sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField(
        desc="Sentiment classification"
    )
    confidence: float = dspy.OutputField(
        desc="Confidence score between 0 and 1"
    )

# Predict module: simplest DSPy module
classifier = dspy.Predict(SentimentAnalysis)

# Test it
tests = [
    "I absolutely love this product! Best purchase I've made.",
    "The package arrived damaged and customer support was useless.",
    "It's okay. Does what it says, nothing special.",
]

print("DSPy Predict (no manual prompt needed):")
print()
for text in tests:
    result = classifier(text=text)
    print(f"Text:       {text[:60]}..." if len(text) > 60 else f"Text:       {text}")
    print(f"Sentiment:  {result.sentiment}")
    print(f"Confidence: {result.confidence:.2f}")
    print()

4.2 ChainOfThought β€” Automatic ReasoningΒΆ

Replace dspy.Predict with dspy.ChainOfThought and DSPy automatically adds step-by-step reasoning to the prompt β€” no manual β€œthink step by step” required.

class MathWordProblem(dspy.Signature):
    """Solve a math word problem step by step."""
    problem: str = dspy.InputField()
    answer: float = dspy.OutputField(desc="The numerical answer")
    unit: str = dspy.OutputField(desc="Unit of the answer (e.g. 'dollars', 'miles', 'hours')")

# ChainOfThought automatically includes reasoning in the prompt
solver = dspy.ChainOfThought(MathWordProblem)

problems = [
    "A train travels at 60 mph. If it leaves at 9am and arrives at 2pm, how far did it travel?",
    "A store has 144 apples. They sell 60% on Monday and 25% of what's left on Tuesday. How many remain?",
]

for problem in problems:
    result = solver(problem=problem)
    print(f"Problem: {problem}")
    print(f"Reasoning: {result.reasoning[:150]}...")
    print(f"Answer: {result.answer} {result.unit}")
    print()

4.3 Building a Multi-Step DSPy PipelineΒΆ

DSPy modules compose like PyTorch layers. Build complex pipelines by combining signatures.

from typing import List

# --- Step 1: Extract key claims from a document ---
class ExtractClaims(dspy.Signature):
    """Extract the main factual claims from a document."""
    document: str = dspy.InputField()
    claims: List[str] = dspy.OutputField(
        desc="List of specific factual claims made in the document"
    )

# --- Step 2: Assess credibility of each claim ---
class AssessClaim(dspy.Signature):
    """Assess whether a claim is likely true, false, or uncertain."""
    claim: str = dspy.InputField()
    verdict: Literal["likely_true", "likely_false", "uncertain"] = dspy.OutputField()
    reasoning: str = dspy.OutputField(desc="Brief explanation of the assessment")

# --- Step 3: Compose into a fact-checking pipeline ---
class FactChecker(dspy.Module):
    def __init__(self):
        super().__init__()
        self.extract = dspy.Predict(ExtractClaims)
        self.assess = dspy.ChainOfThought(AssessClaim)

    def forward(self, document: str) -> dict:
        # Extract claims from document
        extraction = self.extract(document=document)
        
        # Assess each claim
        assessments = []
        for claim in extraction.claims[:3]:  # limit to 3 for speed
            assessment = self.assess(claim=claim)
            assessments.append({
                "claim": claim,
                "verdict": assessment.verdict,
                "reasoning": assessment.reasoning
            })
        
        return {
            "claims_found": len(extraction.claims),
            "assessments": assessments
        }

fact_checker = FactChecker()

article = """
Scientists at MIT announced that daily coffee consumption reduces Alzheimer's risk by 65%.
The study followed 10,000 participants over 20 years. Coffee contains antioxidants that 
protect neurons. The WHO recommends 3 cups per day for adults over 50. Green tea has
similar effects, according to separate research from Harvard Medical School.
"""

results = fact_checker(document=article)
print(f"Claims found: {results['claims_found']}")
print()
for a in results['assessments']:
    print(f"Claim:    {a['claim'][:80]}..." if len(a['claim']) > 80 else f"Claim:    {a['claim']}")
    print(f"Verdict:  {a['verdict']}")
    print(f"Reason:   {a['reasoning'][:100]}...")
    print()

4.4 MIPRO Optimizer β€” Auto-Generate Better PromptsΒΆ

MIPRO (Multi-prompt Instruction PRoposal Optimizer) is DSPy’s most powerful optimizer. It:

  1. Proposes new instruction candidates

  2. Generates few-shot demonstrations from your training data

  3. Runs a Bayesian search over instruction + demo combinations

  4. Evaluates each combination against your metric

  5. Returns the best-performing prompt configuration

# Full optimization pipeline (requires training data + evaluation metric)
# This demonstrates the structure β€” run with larger data for real optimization

import dspy
from dspy.evaluate import Evaluate
from dspy.teleprompt import MIPROv2

# --- 1. Define the task ---
class QuestionAnswer(dspy.Signature):
    """Answer questions accurately and concisely."""
    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="A concise, accurate answer")

class QAModule(dspy.Module):
    def __init__(self):
        super().__init__()
        self.qa = dspy.ChainOfThought(QuestionAnswer)

    def forward(self, question: str) -> dspy.Prediction:
        return self.qa(question=question)

# --- 2. Training data (DSPy examples) ---
train_data = [
    dspy.Example(question="What is the capital of France?", answer="Paris").with_inputs("question"),
    dspy.Example(question="Who wrote Hamlet?", answer="William Shakespeare").with_inputs("question"),
    dspy.Example(question="What is 15% of 200?", answer="30").with_inputs("question"),
    dspy.Example(question="What year did the Berlin Wall fall?", answer="1989").with_inputs("question"),
    dspy.Example(question="How many bones in the human body?", answer="206").with_inputs("question"),
]

# Dev set for evaluation
dev_data = [
    dspy.Example(question="What is the speed of light in km/s?", answer="299,792").with_inputs("question"),
    dspy.Example(question="Who painted the Mona Lisa?", answer="Leonardo da Vinci").with_inputs("question"),
]

# --- 3. Evaluation metric ---
def exact_match_metric(example, prediction, trace=None):
    """Check if key words from the expected answer appear in the prediction."""
    expected = example.answer.lower()
    predicted = prediction.answer.lower()
    # Simple check: expected answer is contained in prediction
    return expected in predicted

# Test baseline (unoptimized)
baseline = QAModule()
evaluator = Evaluate(devset=dev_data, metric=exact_match_metric, num_threads=1)
baseline_score = evaluator(baseline)
print(f"Baseline score: {baseline_score:.1f}%")

# --- 4. Optimize with MIPROv2 ---
# (Using light settings for demonstration β€” production would use more trials)
print("\nRunning MIPROv2 optimization...")
optimizer = MIPROv2(
    metric=exact_match_metric,
    auto="light"  # 'light', 'medium', or 'heavy'
)

optimized_qa = optimizer.compile(
    QAModule(),
    trainset=train_data,
    num_trials=10,   # more trials = better results
    requires_permission_to_run=False  # skip interactive prompt
)

# Evaluate optimized version
optimized_score = evaluator(optimized_qa)
print(f"Optimized score: {optimized_score:.1f}%")
print(f"Improvement: +{optimized_score - baseline_score:.1f}%")

4.5 Inspect What the Optimizer FoundΒΆ

DSPy prompts are transparent β€” you can see exactly what the optimizer generated.

# Inspect the optimized prompt
print("=== Optimized Prompt (auto-generated by MIPRO) ===")
print()
try:
    # Get the actual prompt that will be sent to the LLM
    lm.inspect_history(n=1)
except Exception:
    pass

# See the signatures with their optimized instructions
for name, module in optimized_qa.named_predictors():
    print(f"Module: {name}")
    print(f"Instructions: {module.signature.instructions}")
    print(f"Demos: {len(module.demos)} few-shot examples")
    if module.demos:
        print(f"First demo: {module.demos[0]}")
    print()

# Save optimized program for reuse
# optimized_qa.save("optimized_qa.json")
# Later: loaded_qa = QAModule(); loaded_qa.load("optimized_qa.json")
print("To persist: optimized_qa.save('optimized_qa.json')")
print("To reload:  qa = QAModule(); qa.load('optimized_qa.json')")

4.6 DSPy vs Manual Prompting β€” When to Use EachΒΆ

print("=== DSPy vs Manual Prompting ===")
print()

use_dspy = [
    "You have labeled data and want to maximize accuracy",
    "Prompts break when you switch models (GPT -> Claude, etc.)",
    "You have a complex multi-step pipeline to optimize end-to-end",
    "You need to iterate quickly without manual prompt engineering",
    "You update your LLM provider and don't want to re-engineer prompts",
    "Production pipelines where small accuracy gains have high ROI",
]

use_manual = [
    "One-off scripts or prototypes",
    "You have no training data to optimize against",
    "The task is simple and prompts are stable",
    "Tight latency requirements (optimization adds overhead)",
    "You need full control over exact prompt wording",
]

print("USE DSPy when:")
for item in use_dspy:
    print(f"  + {item}")

print()
print("USE manual prompts when:")
for item in use_manual:
    print(f"  - {item}")

print()
print("=" * 60)
print("Key insight: DSPy doesn't replace prompts β€” it writes them for you.")
print("You define WHAT you want. DSPy figures out HOW to ask for it.")

Summary β€” Choosing the Right ToolΒΆ

                    STRUCTURED OUTPUT LANDSCAPE (2025-2026)

  API MODEL?                      LOCAL MODEL?
  
  Need typed extraction? ──────── Need 100% format guarantee?
       |                                   |
   Instructor                           Outlines
  (parse + retry)                 (token-level mask)
  
  Building agents?                 Optimizing pipelines?
       |                                   |
   PydanticAI                           DSPy
  (type-safe deps)                 (auto-prompt search)

Tool

Install

Core Value

Instructor

pip install instructor

Pydantic models from any LLM API, 15+ providers

Outlines

pip install outlines

Token-level constraints for local models

PydanticAI

pip install pydantic-ai

Type-safe agents with dependency injection

DSPy

pip install dspy

Auto-optimize prompts from data, not intuition

The production stack (2025-2026):

  • Use Instructor for data extraction from documents, APIs, emails

  • Use Outlines for local model deployments where format must be guaranteed

  • Use PydanticAI for multi-tool agents that need testability and type safety

  • Use DSPy when prompt brittleness is costing you β€” let data drive the prompts

# Quick reference: install commands
installs = [
    ("Instructor",  "pip install instructor"),
    ("Outlines",    "pip install outlines transformers torch"),
    ("PydanticAI",  "pip install pydantic-ai"),
    ("DSPy",        "pip install dspy"),
    ("All at once", "pip install instructor outlines pydantic-ai dspy"),
]

print("Install commands:")
for name, cmd in installs:
    print(f"  {name:<12}  {cmd}")

print()
print("Documentation:")
docs = [
    ("Instructor",  "https://python.useinstructor.com"),
    ("Outlines",    "https://dottxt-ai.github.io/outlines"),
    ("PydanticAI",  "https://ai.pydantic.dev"),
    ("DSPy",        "https://dspy.ai"),
]
for name, url in docs:
    print(f"  {name:<12}  {url}")