Document Processing and ChunkingΒΆ
Why Chunking MattersΒΆ
LLMs have token limits. Documents must be split into chunks!
ChallengesΒΆ
Chunk too small β lose context
Chunk too large β exceed token limits
Poor splits β break semantic meaning
StrategiesΒΆ
Fixed-size chunks
Sentence-based
Paragraph-based
Semantic chunking
Recursive chunking
import numpy as np
import pandas as pd
from typing import List, Dict, Tuple
import json
import os
from pathlib import Path
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
Strategy 1: Fixed-Size ChunksΒΆ
The Simplest Approach β Split by Character CountΒΆ
Fixed-size chunking divides text into segments of a predetermined number of characters, optionally with an overlap window so that sentences at chunk boundaries are not lost. The overlap ensures continuity β if a key fact spans two chunks, it will appear in full in at least one of them. While this method is fast and predictable, it pays no attention to sentence or paragraph boundaries, so chunks may start or end mid-sentence. Despite this limitation, fixed-size chunking is a reasonable baseline and is often sufficient when combined with a generous overlap (10-20% of chunk size).
text = """Artificial intelligence has transformed many industries.
Machine learning algorithms can now process vast amounts of data.
Deep learning networks achieve human-level performance in many tasks.
Natural language processing enables computers to understand human language.
Computer vision allows machines to interpret visual information."""
def fixed_size_chunk(text: str, chunk_size: int = 100, overlap: int = 20):
"""Split text into fixed-size chunks with overlap."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap
return chunks
chunks = fixed_size_chunk(text)
for i, chunk in enumerate(chunks, 1):
print(f"Chunk {i}: {chunk}...")
print()
Strategy 2: Sentence-BasedΒΆ
Preserving Linguistic BoundariesΒΆ
Sentence-based chunking uses a tokenizer (such as NLTKβs sent_tokenize) to split text at sentence boundaries, then groups a fixed number of sentences per chunk. Because it never breaks a sentence in half, each chunk is more coherent than a fixed-character chunk. The trade-off is that chunk sizes vary depending on sentence length, which can cause some chunks to be much larger or smaller than others. Grouping 2-5 sentences per chunk is a good starting point for Q&A-style RAG systems.
import nltk
# nltk.download('punkt') # Run once
def sentence_chunk(text: str, sentences_per_chunk: int = 2):
"""Split by sentences."""
sentences = nltk.sent_tokenize(text)
chunks = []
for i in range(0, len(sentences), sentences_per_chunk):
chunk = ' '.join(sentences[i:i+sentences_per_chunk])
chunks.append(chunk)
return chunks, sentences
chunks, sentences = sentence_chunk(text)
print(f"Total sentences: {len(sentences)}")
print(f"Total chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
print(f"Chunk {i}: {chunk}")
print()
Strategy 3: Recursive ChunkingΒΆ
Hierarchical Splitting for Structured TextΒΆ
LangChainβs RecursiveCharacterTextSplitter tries a prioritized list of separators β first double newlines (paragraph breaks), then single newlines, then sentence-ending periods, then spaces, and finally individual characters. It picks the highest-priority separator that produces chunks within the target chunk_size. This strategy naturally respects document structure: it will split at paragraph boundaries when possible, fall back to sentences when paragraphs are too long, and resort to character-level splitting only as a last resort. This makes it the recommended default splitter for most RAG applications.
splitter = RecursiveCharacterTextSplitter(
chunk_size=150,
chunk_overlap=30,
separators=["\n\n", "\n", ". ", " ", ""]
)
long_text = """
Retrieval-Augmented Generation (RAG) is a powerful technique.
It combines two key components:
1. Information Retrieval
2. Text Generation
The retrieval step searches a knowledge base.
The generation step uses an LLM to create responses.
This approach has several advantages.
It reduces hallucinations and provides source attribution.
"""
chunks = splitter.split_text(long_text)
print(f"Created {len(chunks)} chunks:\n")
for i, chunk in enumerate(chunks, 1):
print(f"Chunk {i} ({len(chunk)} chars):")
print(chunk)
print("-" * 50)
Strategy 4: Semantic ChunkingΒΆ
Splitting Based on Meaning, Not Just LengthΒΆ
Semantic chunking uses an embedding model to measure the cosine similarity between consecutive sentences. When the similarity drops below a threshold, it signals a topic shift, and a new chunk begins. Sentences about the same topic stay together regardless of length, while topic transitions produce clean chunk boundaries. The downside is the added compute cost of embedding every sentence, but for corpora where topical coherence within chunks is critical (e.g., technical documentation, legal text), the retrieval quality improvement justifies the expense.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
def semantic_chunk(text: str, threshold: float = 0.5):
"""Chunk based on semantic similarity."""
sentences = nltk.sent_tokenize(text)
if len(sentences) <= 1:
return [text]
# Get embeddings
embeddings = model.encode(sentences)
# Compute similarities
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
# Similarity with previous
sim = np.dot(embeddings[i], embeddings[i-1])
sim /= (np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i-1]))
if sim > threshold:
current_chunk.append(sentences[i])
else:
chunks.append(' '.join(current_chunk))
current_chunk = [sentences[i]]
chunks.append(' '.join(current_chunk))
return chunks
semantic_chunks = semantic_chunk(text, threshold=0.6)
print(f"Semantic chunks: {len(semantic_chunks)}\n")
for i, chunk in enumerate(semantic_chunks, 1):
print(f"Semantic Chunk {i}:")
print(chunk)
print()
Best PracticesΒΆ
Chunk Size GuidelinesΒΆ
Use Case |
Chunk Size |
Overlap |
|---|---|---|
Short Q&A |
200-500 |
50-100 |
Documents |
500-1000 |
100-200 |
Code |
1000-2000 |
200-300 |
TipsΒΆ
β Keep context intact (donβt split mid-sentence) β Add overlap to preserve context across chunks β Include metadata (source, page number, etc.) β Test different strategies for your domain β Monitor retrieval quality
Next: Using LangChain for production RAG!