Run this notebook: Open in Colab Open in Kaggle

Speculative Decoding: 2-3x Faster LLM Inference¶

Use a small draft model to propose tokens, then verify them with the large model in parallel — achieving the quality of the large model at the speed of the small one.

The Core Idea¶

Standard autoregressive decoding generates one token at a time:

large_model → token 1 → token 2 → token 3 ... (slow)

Speculative decoding generates K tokens in parallel:

small_model (draft):    token 1, token 2, token 3, token 4, token 5  (fast, K tokens)
large_model (verify):   ✓        ✓        ✓        ✗  (accept 3, reject 1, regenerate 1)
                        ↑ done in ONE forward pass of the large model!

Result: Same quality as large model, 2-3x faster throughput.

Why It Works¶

The large model verifies K draft tokens in a single forward pass (because it processes the full sequence, not just the next token). If most draft tokens are accepted, you get K tokens per large model pass instead of 1.

Acceptance rate determines speedup:

80% acceptance rate → ~3x speedup
60% acceptance rate → ~2x speedup
40% acceptance rate → ~1.5x speedup (not worth it)

# Install dependencies
# !pip install transformers torch accelerate

1. Speculative Decoding with Hugging Face¶

Hugging Face Transformers supports speculative decoding natively through the assistant_model parameter in .generate(). You load two models – a large target model and a small draft model from the same family (sharing the same tokenizer). During generation, the draft model proposes \(K\) tokens quickly, then the target model verifies all \(K\) tokens in a single forward pass. Accepted tokens become part of the output; rejected tokens are regenerated by the target model. The key constraint is that both models must use the same tokenizer, which is why models from the same family (e.g., Llama 3 70B + Llama 3 8B) produce the best acceptance rates.

import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer

device = 'cuda' if torch.cuda.is_available() else 'cpu'

def load_model_pair(target_model_id: str, draft_model_id: str) -> tuple:
    """
    Load target (large) and draft (small) model pair.
    
    Good pairs:
    - Target: Llama 3 70B, Draft: Llama 3 7B (same family!)
    - Target: GPT-2 XL (1.5B), Draft: GPT-2 (124M)
    
    Key requirement: Same tokenizer (or compatible vocabulary)
    """
    tokenizer = AutoTokenizer.from_pretrained(target_model_id)
    
    # Large target model
    target = AutoModelForCausalLM.from_pretrained(
        target_model_id,
        torch_dtype=torch.float16,
        device_map='auto'
    )
    
    # Small draft model
    draft = AutoModelForCausalLM.from_pretrained(
        draft_model_id,
        torch_dtype=torch.float16,
        device_map='auto'
    )
    
    return target, draft, tokenizer

def generate_with_speculative_decoding(
    target_model,
    draft_model,
    tokenizer,
    prompt: str,
    max_new_tokens: int = 200,
    num_assistant_tokens: int = 5  # K draft tokens per verification step
) -> dict:
    """
    Generate text using speculative decoding.
    HuggingFace supports this natively via assistant_model parameter.
    """
    inputs = tokenizer(prompt, return_tensors='pt').to(device)
    input_len = inputs.input_ids.shape[1]
    
    start = time.time()
    
    # Speculative decoding: pass draft model as assistant
    outputs = target_model.generate(
        **inputs,
        assistant_model=draft_model,         # ← This enables speculative decoding!
        max_new_tokens=max_new_tokens,
        num_assistant_tokens=num_assistant_tokens,  # K draft tokens
        do_sample=False                      # Greedy for deterministic output
    )
    
    elapsed = time.time() - start
    new_tokens = outputs.shape[1] - input_len
    
    return {
        'text': tokenizer.decode(outputs[0][input_len:], skip_special_tokens=True),
        'tokens': new_tokens,
        'time': elapsed,
        'tokens_per_sec': new_tokens / elapsed
    }

def generate_standard(target_model, tokenizer, prompt: str, max_new_tokens: int = 200) -> dict:
    """Standard generation (no speculative decoding) for comparison."""
    inputs = tokenizer(prompt, return_tensors='pt').to(device)
    input_len = inputs.input_ids.shape[1]
    start = time.time()
    outputs = target_model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    elapsed = time.time() - start
    new_tokens = outputs.shape[1] - input_len
    return {'text': tokenizer.decode(outputs[0][input_len:], skip_special_tokens=True), 'tokens': new_tokens, 'time': elapsed, 'tokens_per_sec': new_tokens / elapsed}

print('Speculative decoding functions ready.')
print()
print('Usage:')
print('  target, draft, tok = load_model_pair("gpt2-xl", "gpt2")')
print('  result = generate_with_speculative_decoding(target, draft, tok, "Hello, world")')
print(f'  # → {{\'tokens_per_sec\': ..., \'speedup\': ...}}')

2. Benchmark: Standard vs. Speculative Decoding¶

To quantify the speedup from speculative decoding, we compare tokens-per-second between standard autoregressive generation and speculative generation on the same prompts. The benchmark function below runs both methods, measures wall-clock time, and computes the speedup ratio. On CPU with small models (GPT-2 family), speedups are modest (~1.5x) because the overhead of running two models partially offsets the gains. On GPU with larger model pairs (Llama 3 70B + 8B), speedups of 2-3x are typical because the verification step is heavily parallelized and the draft model adds negligible cost relative to the target.

# Quick demo using GPT-2 family (small, runs on CPU)
# For real speedup, use GPU with larger model pairs

def benchmark_speculative_vs_standard(
    target_id: str = 'gpt2',
    draft_id: str = 'distilgpt2',
    prompts: list = None
) -> None:
    """Compare speculative vs standard decoding on the same prompts."""
    if prompts is None:
        prompts = [
            'The future of artificial intelligence is',
            'In machine learning, the most important concept is',
            'Python is a programming language that',
        ]
    
    tokenizer = AutoTokenizer.from_pretrained(target_id)
    target = AutoModelForCausalLM.from_pretrained(target_id, torch_dtype=torch.float32)
    draft = AutoModelForCausalLM.from_pretrained(draft_id, torch_dtype=torch.float32)
    
    print(f'Model pair: {target_id} (target) + {draft_id} (draft)\n')
    
    total_standard_tps = 0
    total_speculative_tps = 0
    
    for prompt in prompts:
        std = generate_standard(target, tokenizer, prompt, max_new_tokens=50)
        spec = generate_with_speculative_decoding(target, draft, tokenizer, prompt, max_new_tokens=50)
        
        speedup = spec['tokens_per_sec'] / std['tokens_per_sec']
        total_standard_tps += std['tokens_per_sec']
        total_speculative_tps += spec['tokens_per_sec']
        
        print(f'Prompt: "{prompt[:40]}..."')
        print(f'  Standard:    {std["tokens_per_sec"]:6.1f} tokens/sec ({std["time"]:.2f}s)')
        print(f'  Speculative: {spec["tokens_per_sec"]:6.1f} tokens/sec ({spec["time"]:.2f}s)')
        print(f'  Speedup: {speedup:.2f}x')
        print()
    
    avg_speedup = total_speculative_tps / total_standard_tps
    print(f'Average speedup: {avg_speedup:.2f}x')

# Note: Full benchmark needs GPU for meaningful results
print('benchmark_speculative_vs_standard() ready.')
print('Run on GPU for meaningful speedup numbers.')
print()
print('Expected results on GPU (Llama 3 70B + Llama 3 8B):')
print('  Standard:    ~20 tokens/sec')
print('  Speculative: ~50 tokens/sec')
print('  Speedup:     ~2.5x')

3. Choosing Draft-Target Pairs¶

The effectiveness of speculative decoding hinges on the acceptance rate – the fraction of draft tokens that the target model agrees with. Higher acceptance rates mean more tokens per verification pass and greater speedup. The ideal draft model is from the same family as the target (ensuring compatible tokenizers and similar output distributions), but much smaller. A 3B draft model paired with a 70B target typically achieves 75-80% acceptance rates, translating to 2-3x speedup. Cross-family pairs (e.g., Mistral draft for Llama target) work but with lower acceptance rates due to distribution mismatch.

# The draft model MUST share the same tokenizer as the target
# Best results when draft is from the same model family

GOOD_MODEL_PAIRS = [
    {
        'target': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
        'draft':  'meta-llama/Meta-Llama-3.2-3B-Instruct',
        'speedup': '2-3x',
        'acceptance_rate': '~80%',
        'notes': 'Best combo — same family, same tokenizer'
    },
    {
        'target': 'mistralai/Mixtral-8x7B-Instruct-v0.1',
        'draft':  'mistralai/Mistral-7B-Instruct-v0.2',
        'speedup': '1.5-2x',
        'acceptance_rate': '~70%',
        'notes': 'Good pair, same vocabulary'
    },
    {
        'target': 'Qwen/Qwen2.5-72B-Instruct',
        'draft':  'Qwen/Qwen2.5-7B-Instruct',
        'speedup': '2-2.5x',
        'acceptance_rate': '~75%',
        'notes': 'Qwen family, excellent acceptance rate'
    },
    {
        'target': 'gpt2-xl',
        'draft':  'distilgpt2',
        'speedup': '1.5x',
        'acceptance_rate': '~65%',
        'notes': 'CPU-friendly demo pair'
    },
]

print('Recommended draft-target model pairs:')
print()
for pair in GOOD_MODEL_PAIRS:
    print(f"Target: {pair['target'].split('/')[-1]}")
    print(f"Draft:  {pair['draft'].split('/')[-1]}")
    print(f"  Speedup: {pair['speedup']}, Acceptance: {pair['acceptance_rate']}")
    print(f"  {pair['notes']}")
    print()

4. Self-Speculative Decoding (No Draft Model Needed)¶

Self-speculative decoding eliminates the need for a separate draft model by reusing parts of the target model itself for token proposal. Several approaches have emerged: Medusa adds lightweight prediction heads that output tokens at positions \(t+1, t+2, \ldots, t+K\) simultaneously; LayerSkip uses early exit from shallow layers as a fast draft; Lookahead Decoding generates multiple token branches in parallel without any extra training; and EAGLE-2 trains a small draft head on intermediate representations. These methods are especially valuable when no suitable same-family smaller model exists, or when you want to avoid the memory overhead of loading a second model.

# Self-speculative decoding: use the SAME model but skip layers for drafting
# No separate draft model needed — just skip early exit layers
# Used in: DRAFT, LayerSkip (Meta), Lookahead Decoding

# Medusa approach: add multiple prediction heads to predict K future tokens simultaneously
# !pip install medusa

print('Self-speculative decoding approaches (2024-2025):')
print()
SELF_SPEC_METHODS = [
    {
        'name': 'Medusa',
        'idea': 'Add K extra heads to predict tokens at positions t+1, t+2, ..., t+K',
        'speedup': '2-3x',
        'requirement': 'Needs fine-tuning to add heads'
    },
    {
        'name': 'LayerSkip (Meta)',
        'idea': 'Exit early from shallow layers for drafting, use full model for verification',
        'speedup': '1.5-2x',
        'requirement': 'Needs special training'
    },
    {
        'name': 'Lookahead Decoding',
        'idea': 'Generate multiple token branches in parallel, pick the best one',
        'speedup': '1.5x',
        'requirement': 'No training needed — plug-and-play'
    },
    {
        'name': 'EAGLE-2',
        'idea': 'Lightweight draft head trained on feature-level representations',
        'speedup': '3-4x',
        'requirement': 'Small draft head to train'
    },
]

for m in SELF_SPEC_METHODS:
    print(f"{m['name']} ({m['speedup']})")
    print(f"  Idea: {m['idea']}")
    print(f"  Req:  {m['requirement']}")
    print()

5. vLLM – Production Speculative Decoding¶

vLLM combines speculative decoding with PagedAttention and continuous batching for maximum production throughput. PagedAttention manages GPU memory like virtual memory, allowing the server to handle many more concurrent requests. Continuous batching dynamically groups requests into batches as they arrive, maximizing GPU utilization. When speculative decoding is layered on top, per-request latency drops by 2-3x while throughput remains high. The configuration below shows how to enable speculative decoding with a Llama 3 draft-target pair across multiple GPUs using tensor parallelism. vLLM also exposes an OpenAI-compatible API, making it a drop-in replacement for cloud APIs.

# vLLM supports speculative decoding with PagedAttention for maximum throughput
# !pip install vllm

# vLLM speculative decoding config (YAML or Python)
vllm_speculative_config = {
    'model': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
    'speculative_model': 'meta-llama/Meta-Llama-3.2-3B-Instruct',
    'num_speculative_tokens': 5,  # K draft tokens
    'tensor_parallel_size': 2,   # Multi-GPU
    'dtype': 'float16',
}

# Equivalent Python API:
# from vllm import LLM, SamplingParams
# llm = LLM(
#     model='meta-llama/Meta-Llama-3.1-70B-Instruct',
#     speculative_model='meta-llama/Meta-Llama-3.2-3B-Instruct',
#     num_speculative_tokens=5,
#     tensor_parallel_size=2
# )
# outputs = llm.generate(['Tell me about AI'], SamplingParams(max_tokens=200))

print('vLLM speculative decoding config:')
import json
print(json.dumps(vllm_speculative_config, indent=2))
print()
print('vLLM advantages for production:')
print('  • PagedAttention: 2-4x more GPU memory efficiency')
print('  • Continuous batching: process multiple requests simultaneously')
print('  • OpenAI-compatible API server')
print('  • Speculative decoding: 2-3x faster per-request latency')
print()
print('Combined speedup: up to 6-8x over naive HF inference')

Summary¶

Situation                         → Recommendation
────────────────────────────────────────────────────────
Single GPU, latency matters        → Speculative decoding (HF assistant_model)
Multi-GPU production server        → vLLM + speculative decoding
No suitable draft model            → Medusa or EAGLE-2 heads
CPU / Apple Silicon                → GGUF Q4_K_M (llama.cpp already optimized)
OpenAI API (no local model)        → Prompt caching (50% savings)

Exercises¶

Run the GPT-2 + DistilGPT-2 benchmark and measure actual speedup on your machine.
Set up vLLM with a local 7B model and measure throughput with vs. without speculative decoding.
Try different values of num_speculative_tokens (2, 5, 10) and measure acceptance rate vs. speedup.
Compare GGUF Q4_K_M vs. standard float16 inference speed for the same model.
Implement a simple token acceptance rate tracker and plot it over a 100-token generation.