Structured LLM Outputs & Programmatic Prompting (2025-2026)ΒΆ
LLMs return free-form text. Applications need structured, typed data. This notebook covers the four most important tools for taming LLM outputs:
Tool |
Approach |
Best For |
|---|---|---|
Instructor |
Retry + parse after generation |
API-backed models, multi-provider |
Outlines |
Token-level constraint at generation time |
Local models, guaranteed format |
PydanticAI |
Type-safe agents with dependency injection |
Production agents, typed pipelines |
DSPy |
Declarative self-improving prompts |
Auto-optimized pipelines, prompt brittleness |
Prerequisites
pip install instructor outlines pydantic-ai dspy openai anthropic transformers torch
import os
from dotenv import load_dotenv
load_dotenv()
# Verify key is available
api_key = os.getenv('OPENAI_API_KEY')
print('OpenAI key found:', bool(api_key))
Part 1 β InstructorΒΆ
3 Million Monthly Downloads. The Standard for Structured Extraction.ΒΆ
The problem: Every app that uses LLMs eventually writes the same boilerplate:
Call the model
Parse the text response
Validate it matches expected structure
Retry if parsing fails
Map it to a Python object
Instructorβs solution: Wrap any OpenAI-compatible client with one line. Pass a Pydantic model as response_model. Get back a typed Python object β with automatic retry on validation failure.
# pip install instructor
import instructor
from openai import OpenAI
from pydantic import BaseModel, EmailStr, field_validator
# One-line patch: wraps the standard OpenAI client
client = instructor.from_openai(OpenAI())
# Define your expected shape with Pydantic
class User(BaseModel):
name: str
age: int
email: str
# Extract structured data from natural language
user = client.chat.completions.create(
model="gpt-4o-mini",
response_model=User,
messages=[{
"role": "user",
"content": "Extract: John Doe, 30 years old, john@example.com"
}]
)
print(f"Name : {user.name}")
print(f"Age : {user.age}")
print(f"Email: {user.email}")
print(f"Type : {type(user)}")
1.2 Nested Models β Complex Document ExtractionΒΆ
Real documents have hierarchy. Instructor handles nested Pydantic models naturally.
from pydantic import BaseModel
from typing import List, Optional
class Address(BaseModel):
street: str
city: str
state: str
zip_code: str
class LineItem(BaseModel):
description: str
quantity: int
unit_price: float
total: float
class Invoice(BaseModel):
invoice_number: str
vendor_name: str
vendor_address: Address
line_items: List[LineItem]
subtotal: float
tax_rate: float
total_due: float
due_date: str
raw_invoice = """
INVOICE #INV-2025-0042
From: Acme Corp, 123 Main St, Springfield, IL 62701
Due: March 15, 2025
Items:
- 5x Widget Pro @ $49.99 each = $249.95
- 2x Support Contract @ $199.00 each = $398.00
- 1x Setup Fee @ $75.00 = $75.00
Subtotal: $722.95
Tax (8.5%): $61.45
TOTAL DUE: $784.40
"""
invoice = client.chat.completions.create(
model="gpt-4o-mini",
response_model=Invoice,
messages=[{
"role": "user",
"content": f"Extract all invoice data from this document:\n{raw_invoice}"
}]
)
print(f"Invoice #: {invoice.invoice_number}")
print(f"Vendor: {invoice.vendor_name}")
print(f"City: {invoice.vendor_address.city}, {invoice.vendor_address.state}")
print(f"\nLine Items:")
for item in invoice.line_items:
print(f" {item.description}: {item.quantity} x ${item.unit_price:.2f} = ${item.total:.2f}")
print(f"\nTotal Due: ${invoice.total_due:.2f}")
print(f"Due Date: {invoice.due_date}")
1.3 Semantic Validation β LLM-Powered Field ValidationΒΆ
Pydantic validators normally use code logic. Instructor lets you write validators in plain English using llm_validator.
from instructor import llm_validator
from pydantic import field_validator
from typing import Annotated
class ProductReview(BaseModel):
product_name: str
rating: int # 1-5
review_text: Annotated[
str,
llm_validator(
"Must be a genuine product review. Should not contain spam, "
"promotional content, or irrelevant information.",
openai_client=OpenAI()
)
]
@field_validator('rating')
@classmethod
def rating_in_range(cls, v):
if not 1 <= v <= 5:
raise ValueError('Rating must be between 1 and 5')
return v
# Valid review
valid = client.chat.completions.create(
model="gpt-4o-mini",
response_model=ProductReview,
messages=[{
"role": "user",
"content": "Product: Wireless Earbuds. Rating: 4. Review: Great sound quality, "
"comfortable fit, battery lasts 8 hours. Mic quality could be better."
}]
)
print("Valid review extracted:")
print(f" Product: {valid.product_name}")
print(f" Rating: {valid.rating}/5")
print(f" Review: {valid.review_text[:80]}...")
1.4 Automatic Retry on Validation FailureΒΆ
When the LLM returns something that fails Pydantic validation, Instructor automatically sends the error back to the model and asks it to fix the output β up to max_retries times.
from pydantic import field_validator
class StrictAge(BaseModel):
name: str
age: int
@field_validator('age')
@classmethod
def must_be_adult(cls, v):
if v < 18:
raise ValueError(f'Age {v} is below minimum of 18. Must provide age for an adult.')
return v
# Instructor will retry if the model gives age < 18
# max_retries=3 means up to 3 correction attempts
person = client.chat.completions.create(
model="gpt-4o-mini",
response_model=StrictAge,
max_retries=3,
messages=[{
"role": "user",
"content": "Extract adult contact: Sarah Johnson, 25 years old."
}]
)
print(f"Extracted adult: {person.name}, age {person.age}")
1.5 Streaming Partial ObjectsΒΆ
For long responses or large structured objects, Instructor supports streaming. You receive partial, validated objects as tokens arrive.
from instructor import Partial
class ResearchSummary(BaseModel):
title: str
key_findings: List[str]
methodology: str
conclusion: str
print("Streaming partial object...\n")
# stream=True returns an iterator of partial ResearchSummary objects
for partial_summary in client.chat.completions.create_partial(
model="gpt-4o-mini",
response_model=ResearchSummary,
messages=[{
"role": "user",
"content": "Summarize a study on how sleep affects memory consolidation."
}],
stream=True
):
# Each iteration gives more of the object filled in
if partial_summary.title:
print(f"\rTitle: {partial_summary.title[:60]}", end="", flush=True)
print("\n\nFinal object:")
print(f"Title: {partial_summary.title}")
print(f"Findings: {len(partial_summary.key_findings or [])} items")
print(f"Conclusion: {(partial_summary.conclusion or '')[:100]}...")
1.6 Multi-Provider SupportΒΆ
Instructor patches any OpenAI-compatible client with the same API. Swap providers by changing one line.
# ----- Anthropic Claude -----
# import anthropic
# claude_client = instructor.from_anthropic(anthropic.Anthropic())
# result = claude_client.messages.create(
# model="claude-3-5-haiku-20241022",
# response_model=User,
# messages=[{"role": "user", "content": "Extract: Alice, 28, alice@example.com"}],
# max_tokens=1024
# )
# ----- Ollama (local) -----
# from openai import OpenAI as OllamaClient
# ollama_client = instructor.from_openai(
# OllamaClient(base_url="http://localhost:11434/v1", api_key="ollama"),
# mode=instructor.Mode.JSON
# )
# result = ollama_client.chat.completions.create(
# model="llama3.2",
# response_model=User,
# messages=[{"role": "user", "content": "Extract: Alice, 28, alice@example.com"}]
# )
# ----- Google Gemini -----
# import google.generativeai as genai
# gemini_client = instructor.from_gemini(
# client=genai.GenerativeModel(model_name="models/gemini-1.5-flash-latest")
# )
print("Instructor supports 15+ providers:")
providers = [
"OpenAI (gpt-4o, gpt-4o-mini, o1)",
"Anthropic (claude-3-5-sonnet, claude-3-haiku)",
"Google Gemini (gemini-1.5-flash, gemini-1.5-pro)",
"Ollama (llama3, mistral, qwen β local)",
"Cohere, Mistral, Groq, Fireworks, Together AI",
"Azure OpenAI, AWS Bedrock, Vertex AI",
]
for p in providers:
print(f" - {p}")
print("\nSame response_model= API across all providers.")
Part 2 β OutlinesΒΆ
Token-Level Constrained GenerationΒΆ
The key difference from Instructor:
Instructor: Generate text freely, then parse + retry until it fits
Outlines: At each token step, mask out tokens that would violate the schema
This means Outlines physically cannot produce invalid output. No retries needed. Works at inference time on local models.
Step 1: Model wants to emit next token
Step 2: Outlines checks JSON schema / regex / grammar
Step 3: Any token that would make output invalid gets probability = 0
Step 4: Model picks from valid tokens only
Result: Output is always valid β by construction
# pip install outlines transformers torch
# Outlines works with local models via HuggingFace transformers
# ----- Basic JSON generation from local model -----
# import outlines
# from pydantic import BaseModel
#
# # Load a small local model (downloads ~1GB on first run)
# model = outlines.models.transformers("Qwen/Qwen2.5-1.5B-Instruct")
#
# class User(BaseModel):
# name: str
# age: int
# email: str
#
# # Create a JSON generator constrained to User schema
# generator = outlines.generate.json(model, User)
#
# user = generator(
# "Extract user information from: Alice Smith, 32 years old, alice@example.com"
# )
# print(user) # Always valid User β guaranteed
# print(type(user)) # <class '__main__.User'>
print("Outlines constrained generation concepts:")
print()
print("outlines.generate.json(model, PydanticModel) -> always-valid JSON")
print("outlines.generate.regex(model, pattern) -> regex-constrained output")
print("outlines.generate.choice(model, options) -> force pick from list")
print("outlines.generate.cfg(model, grammar) -> context-free grammar")
print()
print("Supported backends:")
print(" outlines.models.transformers('Qwen/Qwen2.5-1.5B-Instruct')")
print(" outlines.models.llamacpp(model_path)")
print(" outlines.models.vllm('mistralai/Mistral-7B-v0.1')")
2.2 Regex-Constrained GenerationΒΆ
Enforce exact formats β dates, phone numbers, IDs β without any post-processing.
# ----- Regex patterns Outlines can enforce -----
# US phone number: (555) 867-5309
PHONE_PATTERN = r"\([0-9]{3}\) [0-9]{3}-[0-9]{4}"
# ISO date: 2025-03-15
DATE_PATTERN = r"[0-9]{4}-[0-9]{2}-[0-9]{2}"
# US ZIP code: 90210 or 90210-1234
ZIP_PATTERN = r"[0-9]{5}(-[0-9]{4})?"
# Credit card number: 4242 4242 4242 4242
CC_PATTERN = r"[0-9]{4} [0-9]{4} [0-9]{4} [0-9]{4}"
# IPv4 address
IPV4_PATTERN = r"([0-9]{1,3}\.){3}[0-9]{1,3}"
# With Outlines (commented since requires GPU/large model):
# generator = outlines.generate.regex(model, DATE_PATTERN)
# date = generator("When did WWII end? Answer with just the date:")
# # Output will ALWAYS match YYYY-MM-DD β no parsing needed
import re
# Simulate what Outlines guarantees (demo without local model)
examples = [
("Phone", PHONE_PATTERN, "(415) 555-1234"),
("Date", DATE_PATTERN, "2025-03-15"),
("ZIP", ZIP_PATTERN, "94102-3456"),
("IPv4", IPV4_PATTERN, "192.168.1.100"),
]
print("Pattern validation (what Outlines guarantees at token level):")
for name, pattern, example in examples:
match = bool(re.fullmatch(pattern, example))
print(f" {name:10} | Pattern: {pattern:35} | Example: {example:20} | Valid: {match}")
2.3 Choice Selection and Outlines with OllamaΒΆ
Force the model to pick from a predefined list of options β useful for classification, routing, and enum fields.
# ----- Choice selection (local model) -----
# With Outlines, the model CAN ONLY emit one of the listed tokens
# No "I think it's probably positive" β just "positive"
#
# import outlines
#
# model = outlines.models.transformers("Qwen/Qwen2.5-1.5B-Instruct")
#
# # Force sentiment to exactly one of three values
# sentiment_gen = outlines.generate.choice(model, ["positive", "negative", "neutral"])
# sentiment = sentiment_gen("Classify: I absolutely love this product!")
# # sentiment is guaranteed to be "positive", "negative", or "neutral" β nothing else
#
# # Force routing decision
# route_gen = outlines.generate.choice(model, ["billing", "technical", "sales", "general"])
# department = route_gen("My credit card was charged twice. Route to:")
# ----- Outlines + Ollama -----
# import outlines
# model = outlines.models.ollama("llama3.2", "http://localhost:11434")
# generator = outlines.generate.json(model, InvoiceSchema)
# result = generator("Extract invoice from: ...")
print("Outlines vs Instructor β when to use which:")
print()
comparison = [
("Generation approach", "Retry after failure", "Block invalid tokens"),
("Model type", "API models (GPT, Claude)","Local (transformers, vLLM, llama.cpp)"),
("Guarantee", "Eventually valid (retries)","Always valid (construction)"),
("Latency", "Higher (retries cost tokens)","Lower (no retries)"),
("Formats", "Any Pydantic schema", "JSON, regex, choice, CFG"),
("GPU required", "No", "Yes (local inference)"),
("Cost", "Per token (API)", "Infrastructure (local)"),
]
print(f"{'Feature':<22} | {'Instructor':<35} | {'Outlines'}")
print("-" * 90)
for row in comparison:
print(f"{row[0]:<22} | {row[1]:<35} | {row[2]}")
Part 3 β PydanticAIΒΆ
Type-Safe Agents with Dependency InjectionΒΆ
Instructor gives you structured extraction. PydanticAI gives you structured agents β with:
Type-safe result models (Pydantic)
Dependency injection for tools (databases, APIs, config)
Automatic retry when validation fails
Streaming structured outputs
First-class support for multi-step reasoning
# pip install pydantic-ai
from pydantic_ai import Agent
from pydantic import BaseModel
class WeatherResult(BaseModel):
city: str
temperature_f: float
temperature_c: float
description: str
recommendation: str
# Agent with structured result type
weather_agent = Agent(
'openai:gpt-4o-mini',
result_type=WeatherResult,
system_prompt=(
"You are a helpful weather assistant. Always provide both Fahrenheit and Celsius. "
"Give a practical recommendation for the conditions."
)
)
result = weather_agent.run_sync("What's a typical mid-March day like in Chicago?")
weather = result.data
print(f"City: {weather.city}")
print(f"Temperature: {weather.temperature_f}F / {weather.temperature_c}C")
print(f"Conditions: {weather.description}")
print(f"Recommendation: {weather.recommendation}")
3.2 Tools with Dependency InjectionΒΆ
PydanticAIβs key feature: inject real dependencies (databases, API clients, config) into tools without globals.
from pydantic_ai import Agent, RunContext
from pydantic import BaseModel
from dataclasses import dataclass
from typing import Dict
# --- Dependency: a simple product catalog ---
@dataclass
class ProductDB:
"""Injected database dependency."""
products: Dict[str, dict]
def get_product(self, name: str) -> dict | None:
return self.products.get(name.lower())
def search(self, category: str) -> list:
return [p for p in self.products.values() if p.get('category') == category]
class ProductRecommendation(BaseModel):
product_name: str
price: float
reason: str
in_stock: bool
# Create agent with typed dependency
product_agent: Agent[ProductDB, ProductRecommendation] = Agent(
'openai:gpt-4o-mini',
deps_type=ProductDB,
result_type=ProductRecommendation,
system_prompt="You are a product recommendation assistant. Use the available tools to look up products."
)
@product_agent.tool
def lookup_product(ctx: RunContext[ProductDB], product_name: str) -> str:
"""Look up a specific product by name."""
product = ctx.deps.get_product(product_name)
if product:
return str(product)
return f"Product '{product_name}' not found"
@product_agent.tool
def search_category(ctx: RunContext[ProductDB], category: str) -> str:
"""Search products in a category."""
items = ctx.deps.search(category)
return str(items) if items else f"No products in category '{category}'"
# Inject real dependency at runtime
db = ProductDB(products={
"wireless headphones": {"name": "Wireless Headphones", "price": 79.99, "category": "audio", "in_stock": True},
"noise cancelling buds": {"name": "Noise Cancelling Buds", "price": 149.99, "category": "audio", "in_stock": True},
"wired earphones": {"name": "Wired Earphones", "price": 29.99, "category": "audio", "in_stock": False},
})
result = product_agent.run_sync(
"Recommend a good audio product for someone on a $100 budget who works in a noisy office.",
deps=db
)
rec = result.data
print(f"Recommended: {rec.product_name}")
print(f"Price: ${rec.price:.2f}")
print(f"In Stock: {rec.in_stock}")
print(f"Reason: {rec.reason}")
3.3 Streaming Structured OutputsΒΆ
PydanticAI supports streaming structured results. You get validated partial objects as the model generates.
import asyncio
from pydantic_ai import Agent
from pydantic import BaseModel
from typing import List
class TechAnalysis(BaseModel):
technology: str
pros: List[str]
cons: List[str]
verdict: str
score: int # 1-10
analysis_agent = Agent(
'openai:gpt-4o-mini',
result_type=TechAnalysis,
system_prompt="Provide balanced technical analysis."
)
async def stream_analysis(topic: str):
async with analysis_agent.run_stream(f"Analyze: {topic}") as stream:
async for partial in stream.stream_structured():
# Each iteration = more fields filled in
if partial.technology:
print(f"\rAnalyzing: {partial.technology}", end="", flush=True)
result = await stream.get_data()
return result
# Run async
analysis = asyncio.run(stream_analysis("Rust vs Python for data pipelines"))
print(f"\nTechnology: {analysis.technology}")
print(f"Score: {analysis.score}/10")
print(f"Pros: {analysis.pros[:2]}")
print(f"Verdict: {analysis.verdict[:100]}...")
3.4 PydanticAI vs LangChainΒΆ
Both build LLM applications. They have different philosophies.
comparison = [
("Type safety", "Full Pydantic types end-to-end", "Partial (improving in v0.3)"),
("Dependency inject.", "First-class, typed RunContext", "Workaround via config/callback"),
("Learning curve", "Low (standard Python + Pydantic)", "Steep (many abstractions)"),
("Ecosystem", "Smaller, focused", "Large (100s of integrations)"),
("Streaming", "Typed partial models", "Token streams + callbacks"),
("Testing", "Easy (inject mock deps)", "Complex (mock chains)"),
("Observability", "Logfire (native)", "LangSmith, Langfuse"),
("Memory", "Bring your own", "Built-in (ConversationBuffer, etc)"),
("Agent framework", "Agents, tools, retries built-in", "LangGraph for complex flows"),
("Best for", "Type-safe production agents", "Rapid prototyping, broad integrations"),
]
print(f"{'Feature':<22} | {'PydanticAI':<40} | {'LangChain'}")
print("-" * 100)
for row in comparison:
print(f"{row[0]:<22} | {row[1]:<40} | {row[2]}")
Part 4 β DSPyΒΆ
Declarative Self-Improving PythonΒΆ
The core problem with manual prompt engineering:
You spend days crafting a prompt that works great on GPT-4o
You switch to Claude or Llama β it breaks
You update the model version β it degrades
You canβt explain why a particular phrasing works
Adding more examples or context is trial and error
DSPyβs answer: Stop writing prompts. Declare what you want (input fields β output fields). Let the optimizer find the best prompts automatically through data-driven search.
Traditional: You β write prompts β LLM β outputs
DSPy: You β define signatures β DSPy optimizer β optimized prompts β LLM β outputs
# pip install dspy
import dspy
from typing import Literal
# Configure the language model
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
# --- Signature: the contract between input and output ---
# The docstring becomes the task description in the prompt
class SentimentAnalysis(dspy.Signature):
"""Classify the sentiment of the given text."""
text: str = dspy.InputField(desc="The text to analyze")
sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField(
desc="Sentiment classification"
)
confidence: float = dspy.OutputField(
desc="Confidence score between 0 and 1"
)
# Predict module: simplest DSPy module
classifier = dspy.Predict(SentimentAnalysis)
# Test it
tests = [
"I absolutely love this product! Best purchase I've made.",
"The package arrived damaged and customer support was useless.",
"It's okay. Does what it says, nothing special.",
]
print("DSPy Predict (no manual prompt needed):")
print()
for text in tests:
result = classifier(text=text)
print(f"Text: {text[:60]}..." if len(text) > 60 else f"Text: {text}")
print(f"Sentiment: {result.sentiment}")
print(f"Confidence: {result.confidence:.2f}")
print()
4.2 ChainOfThought β Automatic ReasoningΒΆ
Replace dspy.Predict with dspy.ChainOfThought and DSPy automatically adds step-by-step reasoning to the prompt β no manual βthink step by stepβ required.
class MathWordProblem(dspy.Signature):
"""Solve a math word problem step by step."""
problem: str = dspy.InputField()
answer: float = dspy.OutputField(desc="The numerical answer")
unit: str = dspy.OutputField(desc="Unit of the answer (e.g. 'dollars', 'miles', 'hours')")
# ChainOfThought automatically includes reasoning in the prompt
solver = dspy.ChainOfThought(MathWordProblem)
problems = [
"A train travels at 60 mph. If it leaves at 9am and arrives at 2pm, how far did it travel?",
"A store has 144 apples. They sell 60% on Monday and 25% of what's left on Tuesday. How many remain?",
]
for problem in problems:
result = solver(problem=problem)
print(f"Problem: {problem}")
print(f"Reasoning: {result.reasoning[:150]}...")
print(f"Answer: {result.answer} {result.unit}")
print()
4.3 Building a Multi-Step DSPy PipelineΒΆ
DSPy modules compose like PyTorch layers. Build complex pipelines by combining signatures.
from typing import List
# --- Step 1: Extract key claims from a document ---
class ExtractClaims(dspy.Signature):
"""Extract the main factual claims from a document."""
document: str = dspy.InputField()
claims: List[str] = dspy.OutputField(
desc="List of specific factual claims made in the document"
)
# --- Step 2: Assess credibility of each claim ---
class AssessClaim(dspy.Signature):
"""Assess whether a claim is likely true, false, or uncertain."""
claim: str = dspy.InputField()
verdict: Literal["likely_true", "likely_false", "uncertain"] = dspy.OutputField()
reasoning: str = dspy.OutputField(desc="Brief explanation of the assessment")
# --- Step 3: Compose into a fact-checking pipeline ---
class FactChecker(dspy.Module):
def __init__(self):
super().__init__()
self.extract = dspy.Predict(ExtractClaims)
self.assess = dspy.ChainOfThought(AssessClaim)
def forward(self, document: str) -> dict:
# Extract claims from document
extraction = self.extract(document=document)
# Assess each claim
assessments = []
for claim in extraction.claims[:3]: # limit to 3 for speed
assessment = self.assess(claim=claim)
assessments.append({
"claim": claim,
"verdict": assessment.verdict,
"reasoning": assessment.reasoning
})
return {
"claims_found": len(extraction.claims),
"assessments": assessments
}
fact_checker = FactChecker()
article = """
Scientists at MIT announced that daily coffee consumption reduces Alzheimer's risk by 65%.
The study followed 10,000 participants over 20 years. Coffee contains antioxidants that
protect neurons. The WHO recommends 3 cups per day for adults over 50. Green tea has
similar effects, according to separate research from Harvard Medical School.
"""
results = fact_checker(document=article)
print(f"Claims found: {results['claims_found']}")
print()
for a in results['assessments']:
print(f"Claim: {a['claim'][:80]}..." if len(a['claim']) > 80 else f"Claim: {a['claim']}")
print(f"Verdict: {a['verdict']}")
print(f"Reason: {a['reasoning'][:100]}...")
print()
4.4 MIPRO Optimizer β Auto-Generate Better PromptsΒΆ
MIPRO (Multi-prompt Instruction PRoposal Optimizer) is DSPyβs most powerful optimizer. It:
Proposes new instruction candidates
Generates few-shot demonstrations from your training data
Runs a Bayesian search over instruction + demo combinations
Evaluates each combination against your metric
Returns the best-performing prompt configuration
# Full optimization pipeline (requires training data + evaluation metric)
# This demonstrates the structure β run with larger data for real optimization
import dspy
from dspy.evaluate import Evaluate
from dspy.teleprompt import MIPROv2
# --- 1. Define the task ---
class QuestionAnswer(dspy.Signature):
"""Answer questions accurately and concisely."""
question: str = dspy.InputField()
answer: str = dspy.OutputField(desc="A concise, accurate answer")
class QAModule(dspy.Module):
def __init__(self):
super().__init__()
self.qa = dspy.ChainOfThought(QuestionAnswer)
def forward(self, question: str) -> dspy.Prediction:
return self.qa(question=question)
# --- 2. Training data (DSPy examples) ---
train_data = [
dspy.Example(question="What is the capital of France?", answer="Paris").with_inputs("question"),
dspy.Example(question="Who wrote Hamlet?", answer="William Shakespeare").with_inputs("question"),
dspy.Example(question="What is 15% of 200?", answer="30").with_inputs("question"),
dspy.Example(question="What year did the Berlin Wall fall?", answer="1989").with_inputs("question"),
dspy.Example(question="How many bones in the human body?", answer="206").with_inputs("question"),
]
# Dev set for evaluation
dev_data = [
dspy.Example(question="What is the speed of light in km/s?", answer="299,792").with_inputs("question"),
dspy.Example(question="Who painted the Mona Lisa?", answer="Leonardo da Vinci").with_inputs("question"),
]
# --- 3. Evaluation metric ---
def exact_match_metric(example, prediction, trace=None):
"""Check if key words from the expected answer appear in the prediction."""
expected = example.answer.lower()
predicted = prediction.answer.lower()
# Simple check: expected answer is contained in prediction
return expected in predicted
# Test baseline (unoptimized)
baseline = QAModule()
evaluator = Evaluate(devset=dev_data, metric=exact_match_metric, num_threads=1)
baseline_score = evaluator(baseline)
print(f"Baseline score: {baseline_score:.1f}%")
# --- 4. Optimize with MIPROv2 ---
# (Using light settings for demonstration β production would use more trials)
print("\nRunning MIPROv2 optimization...")
optimizer = MIPROv2(
metric=exact_match_metric,
auto="light" # 'light', 'medium', or 'heavy'
)
optimized_qa = optimizer.compile(
QAModule(),
trainset=train_data,
num_trials=10, # more trials = better results
requires_permission_to_run=False # skip interactive prompt
)
# Evaluate optimized version
optimized_score = evaluator(optimized_qa)
print(f"Optimized score: {optimized_score:.1f}%")
print(f"Improvement: +{optimized_score - baseline_score:.1f}%")
4.5 Inspect What the Optimizer FoundΒΆ
DSPy prompts are transparent β you can see exactly what the optimizer generated.
# Inspect the optimized prompt
print("=== Optimized Prompt (auto-generated by MIPRO) ===")
print()
try:
# Get the actual prompt that will be sent to the LLM
lm.inspect_history(n=1)
except Exception:
pass
# See the signatures with their optimized instructions
for name, module in optimized_qa.named_predictors():
print(f"Module: {name}")
print(f"Instructions: {module.signature.instructions}")
print(f"Demos: {len(module.demos)} few-shot examples")
if module.demos:
print(f"First demo: {module.demos[0]}")
print()
# Save optimized program for reuse
# optimized_qa.save("optimized_qa.json")
# Later: loaded_qa = QAModule(); loaded_qa.load("optimized_qa.json")
print("To persist: optimized_qa.save('optimized_qa.json')")
print("To reload: qa = QAModule(); qa.load('optimized_qa.json')")
4.6 DSPy vs Manual Prompting β When to Use EachΒΆ
print("=== DSPy vs Manual Prompting ===")
print()
use_dspy = [
"You have labeled data and want to maximize accuracy",
"Prompts break when you switch models (GPT -> Claude, etc.)",
"You have a complex multi-step pipeline to optimize end-to-end",
"You need to iterate quickly without manual prompt engineering",
"You update your LLM provider and don't want to re-engineer prompts",
"Production pipelines where small accuracy gains have high ROI",
]
use_manual = [
"One-off scripts or prototypes",
"You have no training data to optimize against",
"The task is simple and prompts are stable",
"Tight latency requirements (optimization adds overhead)",
"You need full control over exact prompt wording",
]
print("USE DSPy when:")
for item in use_dspy:
print(f" + {item}")
print()
print("USE manual prompts when:")
for item in use_manual:
print(f" - {item}")
print()
print("=" * 60)
print("Key insight: DSPy doesn't replace prompts β it writes them for you.")
print("You define WHAT you want. DSPy figures out HOW to ask for it.")
Summary β Choosing the Right ToolΒΆ
STRUCTURED OUTPUT LANDSCAPE (2025-2026)
API MODEL? LOCAL MODEL?
Need typed extraction? ββββββββ Need 100% format guarantee?
| |
Instructor Outlines
(parse + retry) (token-level mask)
Building agents? Optimizing pipelines?
| |
PydanticAI DSPy
(type-safe deps) (auto-prompt search)
Tool |
Install |
Core Value |
|---|---|---|
Instructor |
|
Pydantic models from any LLM API, 15+ providers |
Outlines |
|
Token-level constraints for local models |
PydanticAI |
|
Type-safe agents with dependency injection |
DSPy |
|
Auto-optimize prompts from data, not intuition |
The production stack (2025-2026):
Use Instructor for data extraction from documents, APIs, emails
Use Outlines for local model deployments where format must be guaranteed
Use PydanticAI for multi-tool agents that need testability and type safety
Use DSPy when prompt brittleness is costing you β let data drive the prompts
# Quick reference: install commands
installs = [
("Instructor", "pip install instructor"),
("Outlines", "pip install outlines transformers torch"),
("PydanticAI", "pip install pydantic-ai"),
("DSPy", "pip install dspy"),
("All at once", "pip install instructor outlines pydantic-ai dspy"),
]
print("Install commands:")
for name, cmd in installs:
print(f" {name:<12} {cmd}")
print()
print("Documentation:")
docs = [
("Instructor", "https://python.useinstructor.com"),
("Outlines", "https://dottxt-ai.github.io/outlines"),
("PydanticAI", "https://ai.pydantic.dev"),
("DSPy", "https://dspy.ai"),
]
for name, url in docs:
print(f" {name:<12} {url}")