Advanced RAG Techniques (2025-2026 State of the Art)ΒΆ

This notebook covers cutting-edge Retrieval-Augmented Generation techniques that go far beyond the basic β€œchunk-embed-retrieve-generate” pipeline. By the end, you will understand:

  1. Why naive RAG fails on certain query types

  2. Semantic Chunking - break on meaning, not token counts

  3. ColBERT Late Interaction Retrieval - multi-vector token-level matching

  4. RAPTOR Hierarchical RAG - tree-structured indexing for holistic queries

  5. Reranking with Cohere - two-stage retrieval for precision

  6. HyDE - Hypothetical Document Embeddings

  7. Complete Advanced RAG Pipeline with benchmarks

Prerequisites: OpenAI API key, Cohere API key, Python 3.10+

Setup: Install DependenciesΒΆ

The advanced retrieval techniques in this notebook rely on several specialized libraries: LangChain for chain orchestration, Cohere for re-ranking, ChromaDB as the vector store, and sentence-transformers for local embeddings. The cell below installs everything in one batch. You will also need API keys for OpenAI and optionally Cohere, which should be set as environment variables or placed in a .env file.

# Install all required packages
# Run once, then restart kernel
!pip install -q langchain langchain-openai langchain-community langchain-experimental
!pip install -q chromadb openai cohere
!pip install -q sentence-transformers ragatouille
!pip install -q tiktoken rank_bm25 scikit-learn numpy pandas
!pip install -q python-dotenv
import os
import warnings
warnings.filterwarnings('ignore')

# Load API keys from environment or .env file
from dotenv import load_dotenv
load_dotenv()

# Set your keys here OR via environment variables
# os.environ["OPENAI_API_KEY"] = "sk-..."
# os.environ["COHERE_API_KEY"] = "..."

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_KEY")
COHERE_API_KEY = os.getenv("COHERE_API_KEY", "YOUR_COHERE_KEY")

print("OpenAI key set:", bool(OPENAI_API_KEY and OPENAI_API_KEY != "YOUR_OPENAI_KEY"))
print("Cohere key set:", bool(COHERE_API_KEY and COHERE_API_KEY != "YOUR_COHERE_KEY"))

Part 1: Why Naive RAG FailsΒΆ

Naive RAG has a simple pipeline:

Document β†’ Fixed-size chunks β†’ Embed chunks β†’ Vector store β†’ Top-k retrieval β†’ LLM

This works well for specific, localized queries (β€œWhat is the capital of France?”) but breaks down on:

Query Type

Why Naive RAG Fails

Aggregate queries

β€œWhat are the main themes across all documents?” β€” no single chunk contains the answer

Multi-hop queries

β€œWho founded the company that acquired Slack?” β€” answer spans multiple chunks

Abstract/holistic

β€œSummarize the evolution of this technology” β€” requires global document understanding

Comparison queries

β€œHow does approach A differ from approach B?” β€” relevant chunks may be far apart

PDF with tables/charts

Text extraction destroys visual structure β€” OCR-based chunking loses semantic meaning

The Core ProblemsΒΆ

  1. Fixed-size chunking splits text at arbitrary token boundaries, breaking semantic units

  2. Single-vector representation compresses entire chunks to one embedding, losing fine-grained token signals

  3. No global document understanding β€” top-k retrieval only finds locally similar chunks

  4. No reranking β€” cosine similarity scores are noisy proxies for relevance

# Demonstrate the failure of naive fixed-size chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Sample document about a fictional company
SAMPLE_DOCUMENT = """
Acme Corporation was founded in 1985 by Sarah Chen and Marcus Webb in San Francisco.
The company started as a small software consultancy focused on enterprise database solutions.

In 1992, Acme pivoted to internet infrastructure as the web began to emerge. This decision
proved transformative. By 1998, they had grown to 500 employees and were processing
10% of all US internet traffic through their routing systems.

The 2001 dot-com crash hit Acme hard. They lost 60% of their revenue in 18 months.
Sarah Chen, serving as CEO, made the controversial decision to lay off 300 employees
and refocus on cloud storage β€” a term barely known at the time.

This gamble paid off spectacularly. By 2008, Acme's CloudVault product had 2 million users.
The following year, they acquired TechStartup Inc., a machine learning company founded
by Dr. James Park, for $450 million.

Between 2010 and 2015, Acme expanded internationally, opening offices in London, Tokyo,
and Singapore. Revenue grew from $200M to $1.2B. The company went public in 2013 at
a $4B valuation.

In 2018, the company rebranded from Acme Corporation to NexCloud Inc. to better reflect
its cloud-first strategy. Sarah Chen retired and was replaced by Dr. James Park as CEO.

NexCloud's 2023 annual report showed $8.2B in revenue, 45,000 employees across 30 countries,
and a market cap exceeding $120B. The company's AI division, built on the TechStartup
acquisition, now accounts for 35% of total revenue.
"""

# Naive fixed-size chunking
naive_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
naive_chunks = naive_splitter.split_text(SAMPLE_DOCUMENT)

print(f"Naive chunking produced {len(naive_chunks)} chunks\n")
print("=" * 60)
for i, chunk in enumerate(naive_chunks):
    print(f"\n--- Chunk {i+1} ---")
    print(chunk)
    print(f"  [Length: {len(chunk)} chars]")
# Show how a multi-hop query fails with naive retrieval
# Query: "Who is the current CEO and what company did they originally lead?"
# Answer requires: (1) NexCloud CEO = Dr. James Park, (2) James Park founded TechStartup Inc.
# These facts are in DIFFERENT chunks!

print("MULTI-HOP QUERY FAILURE DEMONSTRATION")
print("=" * 50)
print("Query: 'Who is the current CEO and what company did they originally found?'")
print()
print("Relevant facts:")
print("  Fact 1 (chunk ~7): Dr. James Park became CEO when Sarah Chen retired")
print("  Fact 2 (chunk ~4): Dr. James Park founded TechStartup Inc.")
print()
print("Problem: These facts are in separate chunks.")
print("A top-1 retrieval will only return ONE chunk.")
print("Even top-5 may miss the connection between them.")
print()
print("Solution: RAPTOR hierarchical indexing (Part 4) or GraphRAG (Notebook 10)")

Part 2: Semantic ChunkingΒΆ

Instead of splitting every N tokens, semantic chunking identifies natural breakpoints where the topic shifts. LangChain’s SemanticChunker does this by:

  1. Split text into sentences

  2. Embed each sentence (or small window)

  3. Compute cosine distance between adjacent embeddings

  4. Place chunk boundaries where distance exceeds a threshold

Breakpoint TypesΒΆ

Method

Description

Best For

percentile

Split at the Nth percentile of distances

General use (default)

standard_deviation

Split when distance > mean + N*std

Consistent documents

interquartile

Split at IQR-based outliers

Documents with topic drift

# Semantic Chunking with LangChain SemanticChunker
# Requires OpenAI embeddings (or any embedding model)

from langchain_openai import OpenAIEmbeddings
from langchain_experimental.text_splitter import SemanticChunker

embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

# --- Percentile method ---
print("SEMANTIC CHUNKING: Percentile Method")
print("=" * 50)

semantic_chunker_percentile = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",   # split at 95th percentile of distances
    breakpoint_threshold_amount=95
)

semantic_chunks_percentile = semantic_chunker_percentile.split_text(SAMPLE_DOCUMENT)
print(f"Produced {len(semantic_chunks_percentile)} chunks\n")
for i, chunk in enumerate(semantic_chunks_percentile):
    print(f"--- Semantic Chunk {i+1} [{len(chunk)} chars] ---")
    print(chunk[:300] + ("..." if len(chunk) > 300 else ""))
    print()
# --- Standard Deviation method ---
print("SEMANTIC CHUNKING: Standard Deviation Method")
print("=" * 50)

semantic_chunker_std = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="standard_deviation",
    breakpoint_threshold_amount=1.25  # split when distance > mean + 1.25 * std
)

semantic_chunks_std = semantic_chunker_std.split_text(SAMPLE_DOCUMENT)
print(f"Produced {len(semantic_chunks_std)} chunks\n")

# --- Interquartile method ---
print("\nSEMANTIC CHUNKING: Interquartile Method")
print("=" * 50)

semantic_chunker_iqr = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="interquartile",
    breakpoint_threshold_amount=1.5
)

semantic_chunks_iqr = semantic_chunker_iqr.split_text(SAMPLE_DOCUMENT)
print(f"Produced {len(semantic_chunks_iqr)} chunks")

print("\n" + "=" * 50)
print("COMPARISON SUMMARY")
print(f"  Naive (200-char chunks):     {len(naive_chunks)} chunks")
print(f"  Semantic (percentile):       {len(semantic_chunks_percentile)} chunks")
print(f"  Semantic (std deviation):    {len(semantic_chunks_std)} chunks")
print(f"  Semantic (interquartile):    {len(semantic_chunks_iqr)} chunks")

Proposition Chunking (Atomic Facts)ΒΆ

Proposition chunking goes even further: it uses an LLM to decompose text into atomic, self-contained facts. Each proposition is a single claim that can stand alone.

Example:

  • Input: β€œSarah Chen, serving as CEO, made the controversial decision to lay off 300 employees…”

  • Propositions:

    • β€œSarah Chen was the CEO of Acme Corporation during the 2001 restructuring.”

    • β€œAcme Corporation laid off 300 employees during the 2001 restructuring.”

    • β€œSarah Chen’s decision to lay off employees was considered controversial.”

# Proposition Chunking via LLM
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0, openai_api_key=OPENAI_API_KEY)

PROPOSITION_SYSTEM_PROMPT = """You are an expert at decomposing text into atomic propositions.
Given a paragraph, extract each distinct, self-contained factual claim as a separate proposition.
Each proposition should:
- Be a complete sentence that makes sense on its own
- Contain exactly ONE atomic fact
- Include the subject explicitly (no pronouns like 'they', 'it', 'he')
Return ONLY a JSON array of strings, no other text."""

def extract_propositions(text: str) -> list[str]:
    """Extract atomic propositions from a text chunk."""
    import json
    response = llm.invoke([
        SystemMessage(content=PROPOSITION_SYSTEM_PROMPT),
        HumanMessage(content=f"Extract propositions from:\n\n{text}")
    ])
    try:
        # Clean markdown code blocks if present
        content = response.content.strip()
        if content.startswith("```"):
            content = content.split("\n", 1)[1].rsplit("```", 1)[0].strip()
        return json.loads(content)
    except Exception as e:
        print(f"Parse error: {e}")
        return [text]  # fallback: return original

# Apply to a sample paragraph
sample_para = """
In 2018, the company rebranded from Acme Corporation to NexCloud Inc. to better reflect
its cloud-first strategy. Sarah Chen retired and was replaced by Dr. James Park as CEO.
""".strip()

propositions = extract_propositions(sample_para)
print("INPUT PARAGRAPH:")
print(sample_para)
print("\nEXTRACTED PROPOSITIONS:")
for i, prop in enumerate(propositions, 1):
    print(f"  {i}. {prop}")
# Apply proposition chunking across the full document
# Use naive chunks as input paragraphs, then decompose each

paragraph_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=0, separators=["\n\n", "\n"]
)
paragraphs = paragraph_splitter.split_text(SAMPLE_DOCUMENT)

all_propositions = []
for para in paragraphs:
    if len(para.strip()) > 50:  # skip very short fragments
        props = extract_propositions(para)
        all_propositions.extend(props)

print(f"Original paragraphs: {len(paragraphs)}")
print(f"Total propositions extracted: {len(all_propositions)}")
print("\nAll propositions:")
for i, prop in enumerate(all_propositions, 1):
    print(f"  {i:2d}. {prop}")

Part 3: ColBERT Late Interaction RetrievalΒΆ

Single-Vector vs Multi-Vector RetrievalΒΆ

Traditional dense retrieval (bi-encoder):

  • Compress entire query into one vector q

  • Compress entire document into one vector d

  • Score = dot(q, d) β€” loses all token-level information

ColBERT late interaction:

  • Encode query into m token vectors Q = [q1, q2, ..., qm]

  • Encode document into n token vectors D = [d1, d2, ..., dn]

  • MaxSim scoring: for each query token, find its best-matching document token

Score(Q, D) = Ξ£α΅’ max_j (qα΅’ Β· dβ±Ό)

This means a query token like β€œCEO” will precisely match the document token β€œCEO” rather than hoping a single compressed vector captures this nuance.

Aspect

Single-Vector

ColBERT

Representation

1 vector per chunk

1 vector per TOKEN

Storage

Low

High (128-dim per token)

Precision

Moderate

High

Speed

Very fast (ANN)

Slower (MaxSim)

Best for

First-stage retrieval

Reranking or small corpora

# ColBERT with RAGatouille
# RAGatouille wraps the ColBERT library with a simple API

from ragatouille import RAGPretrainedModel

# Load a pretrained ColBERT model
# 'colbert-ir/colbertv2.0' is the standard checkpoint
RAG_colbert = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

print("ColBERT model loaded successfully")
print(f"Model type: {type(RAG_colbert)}")
# Index a corpus with ColBERT
# ColBERT builds a token-level inverted index

# Prepare a small corpus from our document + extra passages
corpus = [
    "Acme Corporation was founded in 1985 by Sarah Chen and Marcus Webb in San Francisco.",
    "In 1992, Acme pivoted to internet infrastructure as the web began to emerge.",
    "By 1998, they had grown to 500 employees and were processing 10% of all US internet traffic.",
    "The 2001 dot-com crash hit Acme hard. They lost 60% of their revenue in 18 months.",
    "Sarah Chen made the controversial decision to lay off 300 employees and refocus on cloud storage.",
    "By 2008, Acme's CloudVault product had 2 million users.",
    "In 2009, Acme acquired TechStartup Inc., founded by Dr. James Park, for $450 million.",
    "Between 2010 and 2015, Acme expanded internationally, opening offices in London, Tokyo, and Singapore.",
    "Revenue grew from $200M to $1.2B between 2010 and 2015. The company went public in 2013.",
    "In 2018, Acme rebranded to NexCloud Inc. Sarah Chen retired and Dr. James Park became CEO.",
    "NexCloud's 2023 revenue was $8.2B with 45,000 employees across 30 countries.",
    "NexCloud's AI division now accounts for 35% of total revenue.",
]

# Index with ColBERT β€” this creates token-level embeddings for every document
index_path = RAG_colbert.index(
    index_name="acme_corp_index",
    collection=corpus,
    split_documents=False  # our docs are already sentence-level
)

print(f"Index created at: {index_path}")
# Query the ColBERT index
# ColBERT's MaxSim scoring handles token-level matching

queries = [
    "Who founded the company and when?",
    "What happened during the 2001 downturn?",
    "Who is the current CEO and what company did they originally found?",
]

for query in queries:
    print(f"\nQUERY: '{query}'")
    print("-" * 60)
    results = RAG_colbert.search(query=query, k=3)
    for i, result in enumerate(results, 1):
        score = result.get('score', result.get('relevance_score', 'N/A'))
        content = result.get('content', result.get('passage', str(result)))
        print(f"  [{i}] Score: {score:.4f}" if isinstance(score, float) else f"  [{i}] Score: {score}")
        print(f"      {content}")
# Demonstrate MaxSim scoring concept manually
import numpy as np

def maxsim_score(query_vectors: np.ndarray, doc_vectors: np.ndarray) -> float:
    """
    Compute ColBERT MaxSim score.
    
    For each query token, find its maximum cosine similarity with any document token.
    Sum these per-token maximums to get the final score.
    
    Args:
        query_vectors: (m, d) array of m query token embeddings
        doc_vectors:   (n, d) array of n document token embeddings
    Returns:
        scalar MaxSim score
    """
    # Normalize vectors
    q_norm = query_vectors / np.linalg.norm(query_vectors, axis=1, keepdims=True)
    d_norm = doc_vectors / np.linalg.norm(doc_vectors, axis=1, keepdims=True)
    
    # Compute all pairwise similarities: (m, n)
    similarities = q_norm @ d_norm.T
    
    # MaxSim: for each query token, take max over document tokens
    per_token_max = similarities.max(axis=1)  # shape: (m,)
    
    # Sum over all query tokens
    return per_token_max.sum()

# Simulate with random vectors to show the concept
np.random.seed(42)
dim = 128  # ColBERT uses 128-dim representations

query_tokens = np.random.randn(5, dim)   # 5 query tokens
doc_a_tokens = np.random.randn(12, dim)  # Document A: 12 tokens
doc_b_tokens = np.random.randn(20, dim)  # Document B: 20 tokens

score_a = maxsim_score(query_tokens, doc_a_tokens)
score_b = maxsim_score(query_tokens, doc_b_tokens)

print("MaxSim Score Demonstration (random vectors):")
print(f"  Document A (12 tokens): {score_a:.4f}")
print(f"  Document B (20 tokens): {score_b:.4f}")
print()
print("Key insight: MaxSim is computed over token-level vectors, not a single vector.")
print("This means every query token gets to 'vote' on the most relevant doc token.")
print("Longer documents are naturally advantaged (more tokens to match against).")

Part 4: RAPTOR Hierarchical RAGΒΆ

RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) builds a tree of summaries from the bottom up:

Level 0 (leaves): Original text chunks
    ↓ embed + cluster (GMM)
Level 1: LLM summaries of each cluster
    ↓ embed + cluster again
Level 2: Summaries of summaries
    ↓ ... repeat ...
Level N (root): One global summary

At query time, retrieval searches ACROSS ALL LEVELS simultaneously. Specific queries match leaf nodes; abstract queries match high-level summaries.

When to Use RAPTORΒΆ

  • Long documents (books, research papers, legal contracts)

  • Abstract/holistic questions: β€œWhat is the main argument?”, β€œWhat are the key themes?”

  • Multi-hop questions that require connecting information from multiple sections

# RAPTOR Implementation
import numpy as np
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.schema import HumanMessage
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class RaptorNode:
    """A single node in the RAPTOR tree."""
    text: str
    level: int  # 0 = leaf (original chunk), higher = more abstract summary
    embedding: Optional[np.ndarray] = None
    children: list = field(default_factory=list)  # child node indices
    cluster_id: int = -1

class RaptorIndex:
    """
    RAPTOR hierarchical RAG index.
    
    Builds a tree of LLM-generated summaries from leaf chunks upward.
    Retrieval searches all levels simultaneously.
    """
    
    def __init__(
        self,
        embeddings_model,
        llm,
        max_levels: int = 3,
        cluster_method: str = "kmeans",  # 'kmeans' or 'gmm'
        n_clusters_per_level: int = 3
    ):
        self.embeddings_model = embeddings_model
        self.llm = llm
        self.max_levels = max_levels
        self.cluster_method = cluster_method
        self.n_clusters = n_clusters_per_level
        self.all_nodes: list[RaptorNode] = []  # all nodes across all levels
    
    def _embed_texts(self, texts: list[str]) -> np.ndarray:
        """Embed a list of texts, returning (N, D) array."""
        vectors = self.embeddings_model.embed_documents(texts)
        return np.array(vectors)
    
    def _cluster(self, embeddings: np.ndarray) -> np.ndarray:
        """Cluster embeddings, return cluster assignment array."""
        n = min(self.n_clusters, len(embeddings))
        if n < 2:
            return np.zeros(len(embeddings), dtype=int)
        
        if self.cluster_method == "gmm":
            gm = GaussianMixture(n_components=n, random_state=42)
            labels = gm.fit_predict(embeddings)
        else:
            km = KMeans(n_clusters=n, random_state=42, n_init=10)
            labels = km.fit_predict(embeddings)
        return labels
    
    def _summarize_cluster(self, texts: list[str]) -> str:
        """Use LLM to generate a coherent summary of a cluster of texts."""
        combined = "\n\n".join(f"- {t}" for t in texts)
        response = self.llm.invoke([
            HumanMessage(content=f"""Provide a concise, informative summary of the following 
related text passages. Preserve all key facts, names, dates, and numbers.

{combined}

Summary:""")
        ])
        return response.content.strip()
    
    def build(self, leaf_texts: list[str]):
        """Build the full RAPTOR tree from leaf chunks."""
        print(f"Building RAPTOR tree with {len(leaf_texts)} leaf chunks...")
        
        # Level 0: Create leaf nodes
        embeddings = self._embed_texts(leaf_texts)
        for i, (text, emb) in enumerate(zip(leaf_texts, embeddings)):
            node = RaptorNode(text=text, level=0, embedding=emb)
            self.all_nodes.append(node)
        
        current_level_nodes = list(range(len(leaf_texts)))
        current_texts = leaf_texts
        current_embeddings = embeddings
        
        # Build higher levels
        for level in range(1, self.max_levels + 1):
            if len(current_level_nodes) <= 1:
                print(f"  Level {level}: Only 1 node remaining. Tree complete.")
                break
            
            print(f"  Level {level}: Clustering {len(current_level_nodes)} nodes...")
            labels = self._cluster(current_embeddings)
            unique_clusters = np.unique(labels)
            
            new_level_nodes = []
            new_texts = []
            
            for cluster_id in unique_clusters:
                cluster_mask = labels == cluster_id
                cluster_texts = [current_texts[i] for i in range(len(current_texts)) if cluster_mask[i]]
                cluster_node_indices = [current_level_nodes[i] for i in range(len(current_level_nodes)) if cluster_mask[i]]
                
                print(f"    Cluster {cluster_id}: {len(cluster_texts)} nodes β†’ summarizing...")
                summary = self._summarize_cluster(cluster_texts)
                
                # Create summary node
                summary_node = RaptorNode(
                    text=summary,
                    level=level,
                    cluster_id=int(cluster_id),
                    children=cluster_node_indices
                )
                self.all_nodes.append(summary_node)
                new_level_nodes.append(len(self.all_nodes) - 1)
                new_texts.append(summary)
            
            # Embed the new summary nodes
            new_embeddings = self._embed_texts(new_texts)
            for idx, node_idx in enumerate(new_level_nodes):
                self.all_nodes[node_idx].embedding = new_embeddings[idx]
            
            current_level_nodes = new_level_nodes
            current_texts = new_texts
            current_embeddings = new_embeddings
            print(f"  Level {level}: Created {len(new_level_nodes)} summary nodes")
        
        print(f"\nRAPTOR tree complete: {len(self.all_nodes)} total nodes across {self.max_levels+1} levels")
    
    def retrieve(self, query: str, k: int = 5) -> list[RaptorNode]:
        """Retrieve top-k nodes from ALL levels of the tree."""
        query_emb = np.array(self.embeddings_model.embed_query(query))
        
        scores = []
        for node in self.all_nodes:
            if node.embedding is not None:
                # Cosine similarity
                sim = np.dot(query_emb, node.embedding) / (
                    np.linalg.norm(query_emb) * np.linalg.norm(node.embedding) + 1e-9
                )
                scores.append((sim, node))
        
        scores.sort(key=lambda x: x[0], reverse=True)
        return [node for _, node in scores[:k]]

print("RAPTOR classes defined successfully")
# Build a RAPTOR index on our sample document

embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0, openai_api_key=OPENAI_API_KEY)

# Use sentence-level splits as leaf nodes
import re
sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', SAMPLE_DOCUMENT.strip()) if len(s.strip()) > 30]

raptor = RaptorIndex(
    embeddings_model=embeddings,
    llm=llm,
    max_levels=2,
    cluster_method="kmeans",
    n_clusters_per_level=3
)

raptor.build(sentences)
# Test RAPTOR retrieval on different query types

test_queries = [
    # Specific query (should hit leaf nodes)
    ("What year did Acme acquire TechStartup Inc. and for how much?",
     "Specific β€” should match leaf-level chunk"),
    
    # Abstract query (should hit summary nodes)
    ("Describe the company's overall strategic evolution from founding to present.",
     "Abstract β€” should match high-level summary nodes"),
    
    # Multi-hop query
    ("Who became CEO after Sarah Chen and what was their background?",
     "Multi-hop β€” requires connecting acquisition and leadership facts"),
]

for query, query_type in test_queries:
    print(f"\nQUERY: {query}")
    print(f"Type: {query_type}")
    print("-" * 70)
    
    results = raptor.retrieve(query, k=3)
    for node in results:
        print(f"  [Level {node.level}] {node.text[:150]}{'...' if len(node.text) > 150 else ''}")
    print()

Part 5: Reranking with CohereΒΆ

The Two-Stage Retrieval PatternΒΆ

Stage 1 (Recall):  BM25 or vector search β†’ top-50 candidates (fast, approximate)
Stage 2 (Precision): Cross-encoder reranker β†’ top-5 results (slow, accurate)
Stage 3 (Generation): LLM uses reranked context β†’ answer

Why rerank? Vector search uses bi-encoders: query and document are embedded independently. Cross-encoders (like Cohere Rerank) see both query and document together, giving dramatically better relevance scores at the cost of more compute.

Cohere Rerank 3.5ΒΆ

  • State-of-the-art multilingual reranker

  • Supports documents up to 4096 tokens

  • Returns relevance scores in [0, 1]

  • API: co.rerank(query, documents, model="rerank-v3.5", top_n=5)

# Cohere Reranking
import cohere

co = cohere.Client(api_key=COHERE_API_KEY)

# Simulate Stage 1: retrieve a broad set of candidates
# In practice this would come from vector search or BM25
candidate_documents = [
    "Acme Corporation was founded in 1985 by Sarah Chen and Marcus Webb in San Francisco.",
    "In 1992, Acme pivoted to internet infrastructure as the web began to emerge.",
    "The 2001 dot-com crash hit Acme hard. They lost 60% of their revenue in 18 months.",
    "Sarah Chen made the controversial decision to lay off 300 employees and refocus on cloud storage.",
    "By 2008, Acme's CloudVault product had 2 million users.",
    "In 2009, Acme acquired TechStartup Inc., founded by Dr. James Park, for $450 million.",
    "Between 2010 and 2015, Acme expanded internationally to London, Tokyo, and Singapore.",
    "In 2018, the company rebranded to NexCloud Inc. Sarah Chen retired.",
    "Dr. James Park became CEO of NexCloud in 2018 after Sarah Chen's retirement.",
    "NexCloud's 2023 revenue was $8.2B with 45,000 employees across 30 countries.",
    "NexCloud's AI division accounts for 35% of total revenue as of 2023.",
    "The company's market capitalization exceeded $120 billion in 2023.",
]

query = "Who became CEO after Sarah Chen left and what was their background?"

# Stage 2: Rerank with Cohere
results = co.rerank(
    query=query,
    documents=candidate_documents,
    model="rerank-v3.5",
    top_n=5
)

print(f"QUERY: {query}")
print(f"\nTop 5 results after Cohere reranking (from {len(candidate_documents)} candidates):")
print("=" * 70)
for i, result in enumerate(results.results, 1):
    print(f"\n[{i}] Relevance Score: {result.relevance_score:.4f}")
    print(f"    Original Index: {result.index}")
    print(f"    Text: {candidate_documents[result.index]}")
# Local Cross-Encoder Reranking with sentence-transformers
# Use this when you don't want to use an external API

from sentence_transformers import CrossEncoder

# Load a local cross-encoder model
# ms-marco-MiniLM-L-6-v2 is a fast, small model good for reranking
local_reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Prepare query-document pairs
pairs = [(query, doc) for doc in candidate_documents]

# Get scores β€” cross-encoder sees query+document together
scores = local_reranker.predict(pairs)

# Sort by score
ranked = sorted(zip(scores, candidate_documents), key=lambda x: x[0], reverse=True)

print(f"QUERY: {query}")
print(f"\nTop 5 results β€” Local Cross-Encoder Reranking:")
print("=" * 70)
for i, (score, doc) in enumerate(ranked[:5], 1):
    print(f"\n[{i}] Score: {score:.4f}")
    print(f"    {doc}")
# Compare ordering: before vs after reranking
import pandas as pd

# Show the dramatic reordering that happens
local_scores = local_reranker.predict(pairs)

df = pd.DataFrame({
    "Original Rank": range(1, len(candidate_documents) + 1),
    "Document": [d[:60] + "..." if len(d) > 60 else d for d in candidate_documents],
    "Rerank Score": local_scores
})
df["Reranked Position"] = df["Rerank Score"].rank(ascending=False).astype(int)
df["Rank Change"] = df["Original Rank"] - df["Reranked Position"]

df_sorted = df.sort_values("Reranked Position")
print("RERANKING RESULTS:")
print(df_sorted[["Reranked Position", "Original Rank", "Rerank Score", "Document"]].to_string(index=False))

Part 6: HyDE β€” Hypothetical Document EmbeddingsΒΆ

HyDE (Gao et al., 2022) addresses a fundamental mismatch: queries and documents live in different embedding spaces. A question like β€œWhat year was the company founded?” looks very different from the answer β€œThe company was founded in 1985.”

HyDE approach:

  1. Use an LLM to generate a hypothetical answer to the query (even if hallucinated)

  2. Embed the hypothetical answer (not the original query)

  3. Use that embedding to search the vector store

The hypothetical answer is in the same linguistic style and topic space as real documents, making the embedding a much better search vector.

Query: "What year was NexCloud founded?"
    ↓ LLM generates hypothetical answer
HyDE doc: "NexCloud was founded in [year]. The company was established by..."
    ↓ embed this hypothetical document
Search with hypothetical embedding β†’ much better recall
# HyDE Implementation
import chromadb
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document

# Build a vector store with our corpus
embeddings_model = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

docs = [Document(page_content=text, metadata={"index": i}) 
        for i, text in enumerate(candidate_documents)]

vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings_model,
    collection_name="acme_hyde_demo"
)

print(f"Vector store built with {len(docs)} documents")
# HyDE retrieval function

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3, openai_api_key=OPENAI_API_KEY)

HYDE_SYSTEM_PROMPT = """You are a knowledgeable assistant. Given a question, write a short, 
factual-sounding passage that would directly answer the question. Write as if from a corporate 
history document. Be concise (2-3 sentences). Do not say you don't know β€” generate a 
plausible hypothetical answer based on common patterns."""

def hyde_retrieve(query: str, vectorstore, k: int = 3, verbose: bool = True) -> list:
    """
    HyDE (Hypothetical Document Embedding) retrieval.
    
    1. Generate hypothetical answer with LLM
    2. Embed the hypothetical answer
    3. Use that embedding as the search vector
    """
    # Step 1: Generate hypothetical document
    response = llm.invoke([
        SystemMessage(content=HYDE_SYSTEM_PROMPT),
        HumanMessage(content=f"Question: {query}")
    ])
    hypothetical_doc = response.content.strip()
    
    if verbose:
        print(f"HYPOTHETICAL DOCUMENT:")
        print(f"  '{hypothetical_doc}'")
        print()
    
    # Step 2: Use the hypothetical document for retrieval
    # The vectorstore will embed it and find similar real documents
    results = vectorstore.similarity_search(hypothetical_doc, k=k)
    return results

def standard_retrieve(query: str, vectorstore, k: int = 3) -> list:
    """Standard retrieval: embed the query directly."""
    return vectorstore.similarity_search(query, k=k)

# Compare standard vs HyDE retrieval
test_query = "When was the company originally established and by whom?"

print("=" * 70)
print(f"QUERY: {test_query}")
print("=" * 70)

print("\n--- STANDARD RETRIEVAL ---")
std_results = standard_retrieve(test_query, vectorstore)
for i, doc in enumerate(std_results, 1):
    print(f"[{i}] {doc.page_content}")

print("\n--- HyDE RETRIEVAL ---")
hyde_results = hyde_retrieve(test_query, vectorstore)
print("Retrieved documents:")
for i, doc in enumerate(hyde_results, 1):
    print(f"[{i}] {doc.page_content}")
# Test HyDE on a more challenging abstract query
abstract_query = "How did the leadership transition unfold at the company?"

print("=" * 70)
print(f"QUERY: {abstract_query}")
print("=" * 70)

print("\n--- STANDARD RETRIEVAL ---")
for i, doc in enumerate(standard_retrieve(abstract_query, vectorstore), 1):
    print(f"[{i}] {doc.page_content}")

print("\n--- HyDE RETRIEVAL ---")
for i, doc in enumerate(hyde_retrieve(abstract_query, vectorstore), 1):
    print(f"[{i}] {doc.page_content}")

Part 7: Complete Advanced RAG PipelineΒΆ

Now we combine everything into a production-grade pipeline:

Document
    ↓ Semantic Chunking (LangChain SemanticChunker)
Semantic Chunks
    ↓ HyDE query expansion
Vector Store (ChromaDB)
    ↓ Top-50 candidate retrieval
    ↓ ColBERT re-retrieval (RAGatouille)
    ↓ Cohere Rerank (top-5)
Final Context
    ↓ LLM generation (GPT-4o-mini)
Answer
# Complete Advanced RAG Pipeline

class AdvancedRAGPipeline:
    """
    Production-grade RAG pipeline combining:
    - Semantic chunking
    - HyDE query expansion
    - Vector store retrieval
    - Cohere reranking
    - LLM answer generation
    """
    
    def __init__(
        self,
        openai_api_key: str,
        cohere_api_key: str,
        use_hyde: bool = True,
        use_reranking: bool = True,
        retrieval_k: int = 20,
        final_k: int = 5
    ):
        self.use_hyde = use_hyde
        self.use_reranking = use_reranking
        self.retrieval_k = retrieval_k
        self.final_k = final_k
        
        self.embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
        self.llm = ChatOpenAI(model="gpt-4o-mini", temperature=0, openai_api_key=openai_api_key)
        self.co = cohere.Client(api_key=cohere_api_key)
        self.vectorstore = None
    
    def ingest(self, document: str, collection_name: str = "advanced_rag"):
        """Chunk document semantically and index into vector store."""
        print("[1/2] Semantic chunking...")
        chunker = SemanticChunker(
            self.embeddings,
            breakpoint_threshold_type="percentile",
            breakpoint_threshold_amount=90
        )
        chunks = chunker.split_text(document)
        print(f"      Created {len(chunks)} semantic chunks")
        
        print("[2/2] Indexing into vector store...")
        docs = [Document(page_content=c, metadata={"chunk_id": i}) for i, c in enumerate(chunks)]
        self.vectorstore = Chroma.from_documents(
            documents=docs,
            embedding=self.embeddings,
            collection_name=collection_name
        )
        print(f"      Indexed {len(docs)} documents")
        return self
    
    def _expand_query_hyde(self, query: str) -> str:
        """Generate a hypothetical document for HyDE retrieval."""
        response = self.llm.invoke([
            SystemMessage(content=HYDE_SYSTEM_PROMPT),
            HumanMessage(content=f"Question: {query}")
        ])
        return response.content.strip()
    
    def retrieve_and_rerank(self, query: str, verbose: bool = False) -> list[str]:
        """Full retrieval pipeline with optional HyDE + reranking."""
        # Stage 1: Query expansion (optional)
        search_query = query
        if self.use_hyde:
            search_query = self._expand_query_hyde(query)
            if verbose:
                print(f"  HyDE expansion: '{search_query[:100]}...'")
        
        # Stage 2: Vector retrieval
        raw_results = self.vectorstore.similarity_search(search_query, k=self.retrieval_k)
        candidate_texts = [doc.page_content for doc in raw_results]
        
        if verbose:
            print(f"  Retrieved {len(candidate_texts)} candidates from vector store")
        
        if not candidate_texts:
            return []
        
        # Stage 3: Reranking (optional)
        if self.use_reranking and len(candidate_texts) > self.final_k:
            rerank_results = self.co.rerank(
                query=query,  # use original query for reranking, not HyDE
                documents=candidate_texts,
                model="rerank-v3.5",
                top_n=self.final_k
            )
            final_docs = [candidate_texts[r.index] for r in rerank_results.results]
            if verbose:
                print(f"  Reranked to {len(final_docs)} documents")
        else:
            final_docs = candidate_texts[:self.final_k]
        
        return final_docs
    
    def answer(self, query: str, verbose: bool = False) -> dict:
        """Full RAG pipeline: retrieve β†’ rerank β†’ generate."""
        context_docs = self.retrieve_and_rerank(query, verbose=verbose)
        
        if not context_docs:
            return {"query": query, "answer": "No relevant context found.", "context": []}
        
        context_str = "\n\n".join(f"[{i+1}] {doc}" for i, doc in enumerate(context_docs))
        
        response = self.llm.invoke([
            SystemMessage(content="You are a precise question-answering assistant. Answer based ONLY on the provided context. Be concise and cite specific facts."),
            HumanMessage(content=f"Context:\n{context_str}\n\nQuestion: {query}")
        ])
        
        return {
            "query": query,
            "answer": response.content.strip(),
            "context": context_docs
        }

print("AdvancedRAGPipeline class defined")
# Build the advanced pipeline
pipeline = AdvancedRAGPipeline(
    openai_api_key=OPENAI_API_KEY,
    cohere_api_key=COHERE_API_KEY,
    use_hyde=True,
    use_reranking=True,
    retrieval_k=10,
    final_k=4
)

pipeline.ingest(SAMPLE_DOCUMENT, collection_name="advanced_rag_demo")
# Benchmark: Naive RAG vs Advanced RAG

# Build a naive RAG for comparison
naive_chunks_for_rag = RecursiveCharacterTextSplitter(
    chunk_size=200, chunk_overlap=20
).split_text(SAMPLE_DOCUMENT)

naive_docs = [Document(page_content=c) for c in naive_chunks_for_rag]
naive_vectorstore = Chroma.from_documents(
    documents=naive_docs,
    embedding=OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY),
    collection_name="naive_rag_demo"
)

llm_for_naive = ChatOpenAI(model="gpt-4o-mini", temperature=0, openai_api_key=OPENAI_API_KEY)

def naive_rag_answer(query: str) -> str:
    docs = naive_vectorstore.similarity_search(query, k=3)
    context = "\n\n".join(d.page_content for d in docs)
    resp = llm_for_naive.invoke([
        SystemMessage(content="Answer based ONLY on context. Be concise."),
        HumanMessage(content=f"Context:\n{context}\n\nQuestion: {query}")
    ])
    return resp.content.strip()

# Test questions
benchmark_questions = [
    "What were the major strategic pivots Acme made over its history?",
    "Who founded the AI division and through what mechanism did it come to be part of the company?",
    "Describe the CEO succession history of the company.",
]

print("BENCHMARK: NAIVE RAG vs ADVANCED RAG")
print("=" * 70)

for q in benchmark_questions:
    print(f"\nQ: {q}")
    print("-" * 70)
    
    naive_ans = naive_rag_answer(q)
    print(f"NAIVE RAG:\n  {naive_ans}")
    
    adv_result = pipeline.answer(q, verbose=False)
    print(f"\nADVANCED RAG:\n  {adv_result['answer']}")
    print()

Summary: Choosing the Right TechniqueΒΆ

Technique

When to Use

Key Benefit

Tradeoff

Semantic Chunking

Always (replace fixed-size)

Better chunk coherence

Slightly higher indexing cost

Proposition Chunking

High-precision Q&A, fact retrieval

Maximum granularity

Many LLM calls during indexing

ColBERT

When precision matters on small corpora

Token-level matching

High storage (N tokens Γ— 128 dims)

RAPTOR

Long docs, abstract/holistic queries

Multi-level understanding

Expensive to build (many LLM calls)

Cohere Rerank

Almost always (cheap, high ROI)

Dramatic precision improvement

Extra API call, ~50ms latency

HyDE

When queries are very different from document style

Bridges query-doc gap

Extra LLM call, may introduce noise

Full Advanced Pipeline

Production systems, high-quality requirements

Best overall performance

Higher latency + cost

Key TakeawaysΒΆ

  1. Reranking has the best ROI: It’s cheap (a few milliseconds + API cost) and dramatically improves precision. Use it by default.

  2. Semantic chunking over fixed-size chunking: Almost always better with minimal added cost.

  3. RAPTOR is expensive but powerful: Reserve it for truly long documents or questions requiring global understanding.

  4. ColBERT is best as a reranker: Its token-level precision is overkill for first-stage retrieval but excellent for reranking a small candidate set.

  5. HyDE helps with abstract queries: When users ask β€œexplain X” or β€œhow does Y work” rather than keyword-style queries.