Long-Context Strategies: Working with 128K–1M Token Windows¶
Modern LLMs support massive context windows — but bigger isn’t always better. Learn when to use full context, when to use RAG, and how to handle the “lost in the middle” problem.
Context Window Comparison (2026)¶
Model |
Context |
Best For |
|---|---|---|
GPT-4o |
128K |
General long documents |
Claude 3.5 Sonnet |
200K |
Long documents, codebases |
Gemini 1.5 Pro |
1M |
Video, multi-document |
Gemini 2.0 Flash |
1M |
Fast + long context |
Llama 3.1 405B |
128K |
Open-weights, private |
The “Lost in the Middle” Problem¶
LLMs are best at recalling information from the beginning and end of context. Content in the middle is often ignored or misremembered.
Put your most important information at the start or end of context — not buried in the middle.
Full Context vs. RAG: When to Use Each¶
Situation |
Strategy |
Reason |
|---|---|---|
Single doc < 100K tokens |
Full context |
Simple, no retrieval error |
Multiple large docs |
RAG |
Retrieve only relevant chunks |
Codebase analysis |
Full context |
Needs global understanding |
Customer support KB |
RAG |
Thousands of articles |
Single contract review |
Full context |
Need complete document |
Large product catalog |
RAG |
Too large for context |
# Install dependencies
# !pip install openai anthropic tiktoken
1. Token Counting Before Sending¶
Before sending a long document to an LLM, you need to know whether it fits within the model’s context window. The tiktoken library provides exact token counts for OpenAI models by running the same BPE tokenizer used server-side. The check_context_fit() function compares the token count against the model’s limit, reserving space for the output (reserve_for_output). A rough heuristic is that 1 token is approximately 4 characters in English, but this varies by language and content type (code tends to use more tokens per character). Always count tokens programmatically rather than estimating – exceeding the context window silently truncates input or raises an error.
import tiktoken
from anthropic import Anthropic
# OpenAI token counting
def count_tokens_openai(text: str, model: str = 'gpt-4o') -> int:
"""Count tokens for OpenAI models."""
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
# Context window limits
CONTEXT_LIMITS = {
'gpt-4o': 128_000,
'gpt-4o-mini': 128_000,
'claude-sonnet-4-6': 200_000,
'claude-opus-4-6': 200_000,
'gemini-1.5-pro': 1_000_000,
'gemini-2.0-flash': 1_000_000,
}
def check_context_fit(text: str, model: str = 'gpt-4o', reserve_for_output: int = 4096) -> dict:
"""Check if text fits in model's context window."""
token_count = count_tokens_openai(text)
limit = CONTEXT_LIMITS.get(model, 128_000)
available = limit - reserve_for_output
return {
'tokens': token_count,
'limit': limit,
'fits': token_count <= available,
'utilization': f'{token_count/limit:.1%}',
'chars_approx': token_count * 4 # rough estimate
}
# Demo
sample_text = 'The quick brown fox jumps over the lazy dog. ' * 1000
result = check_context_fit(sample_text)
print(f'Text: {len(sample_text):,} chars')
print(f'Tokens: {result["tokens"]:,}')
print(f'Fits in gpt-4o: {result["fits"]} ({result["utilization"]} utilized)')
2. Document Chunking Strategies¶
When a document exceeds the context window, chunking splits it into manageable pieces. Two complementary strategies are shown below. Token-based chunking splits at fixed token boundaries with configurable overlap – the overlap ensures that sentences or ideas spanning a chunk boundary are not lost. Section-based chunking uses structural markers like Markdown headers or chapter titles to split at semantic boundaries, preserving the logical organization of the document. Section-based chunking is generally preferred for structured documents because each chunk is self-contained, while token-based chunking works better for unstructured text like transcripts or raw notes.
from typing import List
import re
def chunk_by_tokens(
text: str,
max_tokens: int = 8000,
overlap_tokens: int = 200,
model: str = 'gpt-4o'
) -> List[str]:
"""
Split text into overlapping token-based chunks.
Overlap helps prevent losing context at boundaries.
"""
enc = tiktoken.encoding_for_model(model)
tokens = enc.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + max_tokens, len(tokens))
chunk_tokens = tokens[start:end]
chunks.append(enc.decode(chunk_tokens))
start += max_tokens - overlap_tokens
return chunks
def chunk_by_sections(text: str, section_markers: List[str] = None) -> List[dict]:
"""
Split document by semantic sections (headers, chapters).
Better than arbitrary token splits for structured documents.
"""
if section_markers is None:
# Markdown headers or chapter markers
section_markers = [r'^#{1,3}\s', r'^Chapter\s', r'^Section\s', r'^\d+\.\s']
pattern = '|'.join(section_markers)
lines = text.split('\n')
sections = []
current_title = 'Introduction'
current_content = []
for line in lines:
if re.match(pattern, line, re.MULTILINE):
if current_content:
sections.append({
'title': current_title,
'content': '\n'.join(current_content),
'tokens': count_tokens_openai('\n'.join(current_content))
})
current_title = line.strip()
current_content = []
else:
current_content.append(line)
if current_content:
sections.append({'title': current_title, 'content': '\n'.join(current_content), 'tokens': count_tokens_openai('\n'.join(current_content))})
return sections
# Test
long_text = 'Introduction text here.\n# Chapter 1\nContent of chapter 1.\n# Chapter 2\nContent of chapter 2.'
sections = chunk_by_sections(long_text)
print(f'Document split into {len(sections)} sections:')
for s in sections:
print(f' [{s["tokens"]:4d} tokens] {s["title"]}')
3. Map-Reduce for Very Long Documents¶
Map-reduce is the standard pattern for processing documents that exceed any model’s context window. The map phase sends each chunk to a smaller, cheaper model (e.g., gpt-4o-mini) with a focused task like “extract key facts.” The reduce phase combines all chunk-level results into a single prompt and sends it to a more capable model (e.g., gpt-4o) for synthesis. This two-stage approach can handle documents of arbitrary length – even millions of tokens – because each individual API call stays within context limits. The trade-off is that the map phase loses cross-chunk context, so it works best for tasks like summarization, fact extraction, and classification rather than tasks requiring global reasoning.
from openai import OpenAI
import asyncio
client = OpenAI()
def process_chunk(chunk: str, task: str, model: str = 'gpt-4o-mini') -> str:
"""Process a single chunk (the 'map' step)."""
response = client.chat.completions.create(
model=model,
messages=[
{'role': 'system', 'content': f'You are processing a section of a larger document. Task: {task}'},
{'role': 'user', 'content': chunk}
],
max_tokens=500
)
return response.choices[0].message.content
def map_reduce_document(
document: str,
map_task: str, # e.g., 'Extract key facts'
reduce_task: str, # e.g., 'Synthesize into a final summary'
chunk_size: int = 8000,
model: str = 'gpt-4o-mini',
final_model: str = 'gpt-4o'
) -> str:
"""
Map-Reduce for documents too long for any context window.
1. MAP: process each chunk independently
2. REDUCE: combine all chunk results into final answer
"""
# Step 1: Chunk the document
chunks = chunk_by_tokens(document, max_tokens=chunk_size)
print(f'Processing {len(chunks)} chunks...')
# Step 2: Map — process each chunk
chunk_results = []
for i, chunk in enumerate(chunks):
result = process_chunk(chunk, map_task, model=model)
chunk_results.append(result)
print(f' Chunk {i+1}/{len(chunks)} processed.')
# Step 3: Reduce — combine results
combined = '\n\n'.join([f'[Chunk {i+1} findings]:\n{r}' for i, r in enumerate(chunk_results)])
final_response = client.chat.completions.create(
model=final_model,
messages=[
{'role': 'system', 'content': reduce_task},
{'role': 'user', 'content': combined}
],
max_tokens=2000
)
return final_response.choices[0].message.content
print('Map-Reduce pipeline ready.')
print()
print('Example:')
print(' summary = map_reduce_document(')
print(' very_long_document,')
print(' map_task="Extract key facts and dates from this section",')
print(' reduce_task="Synthesize all extracted facts into a coherent timeline"')
print(' )')
4. Context Window Cost Calculator¶
Long-context requests can be expensive – sending 100K tokens to GPT-4o costs $0.25 per request in input tokens alone. The cost calculator below compares pricing across models to help you choose the right one for your workload. For extraction tasks on long documents, gpt-4o-mini or gemini-2.0-flash can be 15-30x cheaper than flagship models while delivering comparable results. At scale (1000+ requests per day), the difference between models can mean hundreds of dollars per day, making model selection one of the most impactful cost optimization decisions.
# Pricing per 1M tokens (approximate, 2026)
PRICING = {
'gpt-4o': {'input': 2.50, 'output': 10.00},
'gpt-4o-mini': {'input': 0.15, 'output': 0.60},
'claude-sonnet-4-6': {'input': 3.00, 'output': 15.00},
'claude-haiku-4-5': {'input': 0.80, 'output': 4.00},
'gemini-1.5-pro': {'input': 1.25, 'output': 5.00},
'gemini-2.0-flash': {'input': 0.075, 'output': 0.30},
}
def calculate_context_cost(
input_tokens: int,
output_tokens: int = 1000,
model: str = 'gpt-4o',
n_requests: int = 1
) -> dict:
"""Calculate the cost of running a long-context request."""
prices = PRICING.get(model, PRICING['gpt-4o'])
input_cost = (input_tokens / 1_000_000) * prices['input'] * n_requests
output_cost = (output_tokens / 1_000_000) * prices['output'] * n_requests
total = input_cost + output_cost
return {'input_cost': input_cost, 'output_cost': output_cost, 'total': total}
# Compare costs for the same 100K token document
print('Cost comparison: 100K token document, 1000 output tokens')
print('-' * 60)
for model in PRICING:
cost = calculate_context_cost(100_000, 1_000, model)
print(f'{model:25s} → ${cost["total"]:6.3f} per request')
print()
print('At 1000 requests/day:')
for model in ['gpt-4o', 'gpt-4o-mini', 'claude-haiku-4-5', 'gemini-2.0-flash']:
cost = calculate_context_cost(100_000, 1_000, model, n_requests=1_000)
print(f' {model:25s} → ${cost["total"]:,.2f}/day')
5. Caching for Repeated Long Contexts¶
When you ask multiple questions about the same document, prompt caching eliminates the cost of re-processing the document tokens on each call. Anthropic’s explicit cache_control marks specific content blocks as cacheable, giving a 90% discount on cached tokens after the first request. OpenAI provides automatic caching for prompts over 1024 tokens with a 50% discount. For a 50K-token document queried 100 times per day, caching reduces the daily cost from \(15 to \)1.50 with Anthropic. The key implementation detail is that the cached portion (document) must remain identical across requests – only the question changes.
# Prompt caching dramatically reduces cost for repeated long context
# OpenAI: automatic caching for prompts > 1024 tokens (50% off cached tokens)
# Anthropic: explicit cache_control marks (90% off cached tokens)
from anthropic import Anthropic
claude = Anthropic()
def ask_about_document_cached(document: str, question: str) -> str:
"""
Use Anthropic prompt caching for repeated queries on the same document.
First call: full price. Subsequent calls: 90% discount on document tokens.
"""
response = claude.messages.create(
model='claude-sonnet-4-6',
max_tokens=1000,
system=[
{
'type': 'text',
'text': 'You are a document analyst. Answer questions about the provided document.',
'cache_control': {'type': 'ephemeral'} # Cache the system prompt
}
],
messages=[
{
'role': 'user',
'content': [
{
'type': 'text',
'text': f'<document>\n{document}\n</document>',
'cache_control': {'type': 'ephemeral'} # Cache the document
},
{
'type': 'text',
'text': question # Question is NOT cached — changes each time
}
]
}
]
)
# Check cache hit
usage = response.usage
cache_read = getattr(usage, 'cache_read_input_tokens', 0)
cache_write = getattr(usage, 'cache_creation_input_tokens', 0)
if cache_read > 0:
print(f'Cache HIT: {cache_read:,} cached tokens (90% discount applied)')
elif cache_write > 0:
print(f'Cache WRITE: {cache_write:,} tokens cached for future requests')
return response.content[0].text
print('Prompt caching function ready.')
print()
print('Savings example:')
print(' Document: 50K tokens × $3.00/M = $0.15 per request')
print(' With caching (after first call): $0.015 per request (90% off!)')
print(' At 100 queries/day: $15/day → $1.50/day')
6. Best Practices Summary¶
# 1. Place most important information at start or end
# BAD:
prompt = f"Background: {long_background}\nImportant: {critical_info}\nMore background: {more_background}"
# GOOD:
prompt = f"Important: {critical_info}\nBackground: {long_background}"
# 2. Use explicit structure markers
# Helps the model track where information is:
prompt = """
<document id='contract_2024'>
{document_text}
</document>
<question>{question}</question>
"""
# 3. Count tokens before sending
if count_tokens_openai(prompt) > 100_000:
# Use map-reduce or RAG instead
result = map_reduce_document(document, ...)
else:
result = client.chat.completions.create(...)
# 4. Use cheaper models for long context
# 100K tokens in GPT-4o = $0.25 input cost
# 100K tokens in GPT-4o-mini = $0.015 input cost
# Use mini for extraction, gpt-4o for complex reasoning
# 5. Cache repeated documents
# Use Anthropic cache_control or OpenAI auto-caching for repeated queries on same doc
Exercises¶
Download a long PDF and measure its token count. Test with GPT-4o vs. RAG — which gives better answers?
Implement map-reduce summarization on a 50K+ word document.
Use Anthropic prompt caching to query the same document 10 times — measure cost savings.
Test the “lost in the middle” effect: hide a specific fact at position 10%, 50%, and 90% and see which the model recalls.
Build a cost calculator to decide: for your use case, is full context or RAG cheaper?