Run this notebook: Open in Colab Open in Kaggle

Dataset Preparation for LLM Fine-tuning¶

Phase 12 — Notebook 01

Before you write a single line of training code, you need a high-quality dataset. This notebook covers everything you need to know about preparing data for LLM fine-tuning — from raw data to training-ready format.

What You Will Learn¶

Why data quality matters more than quantity
The three main dataset formats: instruction-response, conversations, preference pairs
How to load and explore datasets from Hugging Face Hub
Data cleaning: deduplication, quality filtering, length filtering
How to build a custom dataset from scratch
Data augmentation techniques for small datasets
Formatting functions for Llama 3, Qwen 2.5, Phi-4, and Gemma 2
Train/validation split best practices
Saving and versioning your datasets

Prerequisite: Complete 00_START_HERE.ipynb before this notebook.

1. Install Dependencies¶

# Install required libraries
# Uncomment and run if in a fresh environment (e.g., Colab)

# !pip install -q datasets transformers pandas numpy matplotlib tqdm
# !pip install -q datasketch  # For fuzzy deduplication

print("Dependencies ready. Uncomment the lines above if you need to install them.")

2. Imports¶

import json
import re
import hashlib
import random
import warnings
from collections import Counter, defaultdict
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

from datasets import (
    Dataset,
    DatasetDict,
    load_dataset,
    concatenate_datasets,
)
from transformers import AutoTokenizer

warnings.filterwarnings("ignore")
random.seed(42)
np.random.seed(42)

print("All imports successful.")
print()

# Verify key library versions
import datasets
import transformers
print(f"datasets:       {datasets.__version__}")
print(f"transformers:   {transformers.__version__}")
print(f"pandas:         {pd.__version__}")
print(f"numpy:          {np.__version__}")

3. Why Data Quality Matters More Than Quantity¶

The single most important factor in fine-tuning success is data quality — not dataset size.

The Research Evidence¶

LIMA (2023): Fine-tuning Llama on just 1,000 carefully curated examples matched models trained on 50,000+ examples. The paper’s conclusion: “Almost all knowledge in large language models is learned during pretraining, with only limited instruction tuning data necessary to teach models to produce high quality output.”
Alpaca (2023): 52,000 instruction-following examples from GPT-3.5. Quality was mediocre, leading to a model with noticeable hallucinations and inconsistencies.
Orca (2023): Fewer examples but with chain-of-thought reasoning traces → significantly better model behavior.

The Garbage In, Garbage Out Principle¶

Low-quality data effects:
  - Model learns incorrect facts
  - Inconsistent response style
  - Hallucinations increase
  - Model behavior becomes unpredictable
  - Wasted GPU compute

High-quality data effects:
  - Consistent, reliable responses
  - Correct factual information
  - Predictable behavior
  - Better generalization
  - Faster convergence (fewer epochs needed)

Practical Guidance on Dataset Size¶

Task Complexity	Minimum Examples	Sweet Spot	Diminishing Returns
Simple formatting	100–200	500–1K	>5K
Domain adaptation	500–1K	2K–10K	>50K
Instruction following	1K–2K	5K–20K	>100K
Complex reasoning	2K–5K	10K–50K	>200K
Alignment (DPO)	500 pairs	2K–10K pairs	>50K pairs

Rule of thumb: If you have 1,000 excellent examples, you will almost certainly outperform 100,000 mediocre ones.

4. Dataset Formats¶

There are three primary dataset formats for LLM fine-tuning. The format you need depends on your fine-tuning objective.

Format 1: Instruction-Response (Alpaca Format)¶

Used for: Supervised fine-tuning (SFT) to teach instruction following.

{
  "instruction": "Translate the following English text to French.",
  "input": "The weather is beautiful today.",
  "output": "Le temps est magnifique aujourd'hui."
}

The input field is optional (for tasks that need context beyond the instruction).

Format 2: Conversations (ChatML / ShareGPT Format)¶

Used for: Multi-turn chat fine-tuning.

{
  "conversations": [
    {"role": "system",    "content": "You are a helpful Python programming tutor."},
    {"role": "user",      "content": "What is a list comprehension?"},
    {"role": "assistant", "content": "A list comprehension is a concise way to create lists..."},
    {"role": "user",      "content": "Can you show me an example?"},
    {"role": "assistant", "content": "Sure! Here's a simple example: [x**2 for x in range(10)]..."}
  ]
}

Format 3: Preference Pairs (DPO Format)¶

Used for: Direct Preference Optimization (DPO) to align with human preferences.

{
  "prompt": "Explain why the sky is blue.",
  "chosen": "The sky appears blue due to Rayleigh scattering. When sunlight enters Earth's atmosphere, it collides with gas molecules. Blue light has a shorter wavelength and scatters more than red light, so we see a blue sky.",
  "rejected": "The sky is blue because that's just how it is. God made it blue or something like that."
}

# Demonstrate all three formats as Python objects

# Format 1: Instruction-Response (Alpaca)
alpaca_example = {
    "instruction": "Summarize the following text in one sentence.",
    "input": """Machine learning is a branch of artificial intelligence that allows 
    computers to learn from data without being explicitly programmed. It uses 
    algorithms to analyze patterns and make decisions with minimal human intervention.""",
    "output": "Machine learning is an AI branch where computers learn patterns from data to make decisions automatically."
}

# Format 2: Conversations (ChatML)
chatml_example = {
    "conversations": [
        {"role": "system", "content": "You are an expert data scientist."},
        {"role": "user", "content": "What's the difference between precision and recall?"},
        {"role": "assistant", "content": "Precision measures how many of your positive predictions were actually positive (avoiding false alarms). Recall measures how many actual positives you correctly identified (avoiding misses). High precision = few false positives. High recall = few false negatives."},
        {"role": "user", "content": "When would I prioritize one over the other?"},
        {"role": "assistant", "content": "Prioritize recall when missing a positive is costly (e.g., cancer screening — missing a case is worse than a false alarm). Prioritize precision when false positives are costly (e.g., spam detection — legitimate emails in spam folder is annoying)."}
    ]
}

# Format 3: Preference Pairs (DPO)
dpo_example = {
    "prompt": "How do I get better at programming?",
    "chosen": """To improve at programming: (1) Build real projects — don't just follow tutorials. 
(2) Read others' code on GitHub. (3) Practice daily, even 30 minutes. 
(4) Struggle with problems before looking up answers. (5) Join communities (Reddit, Discord, Stack Overflow). 
The key is deliberate practice with feedback loops.""",
    "rejected": "Just watch YouTube videos and copy the code. Do a lot of tutorials."
}

print("=== Format 1: Instruction-Response (Alpaca) ===")
print(json.dumps(alpaca_example, indent=2))
print()
print("=== Format 2: Conversations (ChatML) ===")
print(json.dumps(chatml_example, indent=2))
print()
print("=== Format 3: Preference Pairs (DPO) ===")
print(json.dumps(dpo_example, indent=2))

5. Loading Datasets from Hugging Face Hub¶

The Hugging Face Hub has thousands of fine-tuning datasets. Here are the top datasets for each use case.

# Popular fine-tuning datasets reference

popular_datasets = {
    "Instruction Following": [
        ("databricks/databricks-dolly-15k", "15K human-written instruction pairs, open license"),
        ("tatsu-lab/alpaca", "52K GPT-3.5 generated, Alpaca format"),
        ("HuggingFaceH4/ultrachat_200k", "200K multi-turn conversations (filtered)"),
        ("teknium/OpenHermes-2.5", "1M+ GPT-4 quality instruction pairs"),
    ],
    "Coding": [
        ("iamtarun/python_code_instructions_18k_alpaca", "18K Python coding instructions"),
        ("codeparrot/github-code", "Large code corpus for continual pre-training"),
        ("bigcode/the-stack-smol", "Permissive-license code, many languages"),
    ],
    "Math & Reasoning": [
        ("gsm8k", "8.5K grade school math word problems with solutions"),
        ("lighteval/MATH", "12.5K competition math problems"),
        ("nvidia/OpenMathInstruct-2", "14M math instruction pairs from Llama 3"),
    ],
    "Alignment (DPO)": [
        ("Anthropic/hh-rlhf", "Human preference data from Anthropic"),
        ("HuggingFaceH4/ultrafeedback_binarized", "250K preference pairs, high quality"),
        ("argilla/dpo-mix-7k", "7K curated DPO pairs"),
    ],
}

print("Popular Fine-tuning Datasets on Hugging Face Hub")
print("=" * 60)
for category, datasets_list in popular_datasets.items():
    print(f"\n{category}:")
    for dataset_id, description in datasets_list:
        print(f"  {dataset_id}")
        print(f"    -> {description}")

# Load the Dolly dataset — one of the best open-license instruction datasets

print("Loading databricks/databricks-dolly-15k...")
dolly_dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

print(f"Dataset size:   {len(dolly_dataset):,} examples")
print(f"Column names:   {dolly_dataset.column_names}")
print(f"Dataset type:   {type(dolly_dataset)}")
print()

# Show category distribution
categories = Counter(dolly_dataset["category"])
print("Category distribution:")
for cat, count in sorted(categories.items(), key=lambda x: -x[1]):
    bar = "#" * (count // 100)
    print(f"  {cat:<35} {count:>5}  {bar}")

# Explore the dataset structure

print("=" * 60)
print("Sample Examples from Dolly Dataset")
print("=" * 60)

# Show 3 diverse examples
example_categories = ["open_qa", "creative_writing", "summarization"]

for target_cat in example_categories:
    # Find first example of this category
    example = next(
        (ex for ex in dolly_dataset if ex["category"] == target_cat),
        None
    )
    if example:
        print(f"\nCategory: {example['category']}")
        print(f"Instruction: {example['instruction'][:150]}..." if len(example['instruction']) > 150 else f"Instruction: {example['instruction']}")
        if example['context']:
            print(f"Context: {example['context'][:100]}..." if len(example['context']) > 100 else f"Context: {example['context']}")
        print(f"Response: {example['response'][:200]}..." if len(example['response']) > 200 else f"Response: {example['response']}")
        print("-" * 40)

# Convert to pandas for easy exploration

dolly_df = dolly_dataset.to_pandas()

print("Basic statistics:")
print(dolly_df.describe(include="all"))
print()

# Compute text lengths
dolly_df["instruction_len"] = dolly_df["instruction"].str.len()
dolly_df["response_len"] = dolly_df["response"].str.len()
dolly_df["context_len"] = dolly_df["context"].str.len()

print("Text length statistics (characters):")
print(dolly_df[["instruction_len", "response_len", "context_len"]].describe().round(0))

6. Data Cleaning: Deduplication¶

Duplicate examples in training data cause the model to over-weight certain patterns, leading to memorization artifacts and reduced generalization. Always deduplicate before training.

# --- Exact Deduplication ---
# Hash each example and remove exact duplicates

def exact_dedup(dataset, text_field="instruction"):
    """
    Remove exact duplicate examples by hashing a key field.
    
    Args:
        dataset: HuggingFace Dataset
        text_field: field to use for deduplication key
    
    Returns:
        Deduplicated Dataset
    """
    seen_hashes = set()
    keep_indices = []
    
    for idx, example in enumerate(dataset):
        # Create a hash from the instruction field
        text = example[text_field].strip().lower()
        text_hash = hashlib.md5(text.encode()).hexdigest()
        
        if text_hash not in seen_hashes:
            seen_hashes.add(text_hash)
            keep_indices.append(idx)
    
    removed = len(dataset) - len(keep_indices)
    print(f"Exact deduplication:")
    print(f"  Before: {len(dataset):,} examples")
    print(f"  After:  {len(keep_indices):,} examples")
    print(f"  Removed: {removed:,} exact duplicates ({removed/len(dataset)*100:.1f}%)")
    
    return dataset.select(keep_indices)


# Apply to dolly dataset
dolly_deduped = exact_dedup(dolly_dataset, text_field="instruction")
print()

# --- Near-Duplicate Detection (MinHash sketch) ---
# For true production use, install: pip install datasketch
# Below is a lightweight n-gram based similarity approach

def compute_ngram_fingerprint(text, n=5, num_shingles=100):
    """
    Create a character n-gram fingerprint for fuzzy deduplication.
    A lightweight alternative to MinHash for small datasets.
    """
    text = text.lower().strip()
    # Generate character n-grams
    ngrams = [text[i:i+n] for i in range(len(text) - n + 1)]
    if not ngrams:
        return frozenset()
    # Sample for efficiency on long texts
    if len(ngrams) > num_shingles:
        ngrams = random.sample(ngrams, num_shingles)
    return frozenset(ngrams)


def jaccard_similarity(set_a, set_b):
    """Compute Jaccard similarity between two sets."""
    if not set_a or not set_b:
        return 0.0
    intersection = len(set_a & set_b)
    union = len(set_a | set_b)
    return intersection / union


def fuzzy_dedup_small(dataset, text_field="instruction", threshold=0.8, max_comparisons=5000):
    """
    Near-duplicate removal for smaller datasets (<10K examples).
    For large datasets, use MinHash LSH (datasketch library).
    
    Args:
        threshold: Jaccard similarity above this = considered duplicate (0.8 = 80% similar)
    """
    # Limit for demo purposes
    sample_size = min(len(dataset), max_comparisons)
    
    # Compute fingerprints
    fingerprints = [
        compute_ngram_fingerprint(dataset[i][text_field])
        for i in range(sample_size)
    ]
    
    keep = list(range(sample_size))
    removed_as_dup = set()
    
    for i in range(sample_size):
        if i in removed_as_dup:
            continue
        for j in range(i + 1, min(i + 50, sample_size)):  # Compare with next 50 only
            if j in removed_as_dup:
                continue
            sim = jaccard_similarity(fingerprints[i], fingerprints[j])
            if sim >= threshold:
                removed_as_dup.add(j)
    
    keep_indices = [i for i in range(sample_size) if i not in removed_as_dup]
    removed = sample_size - len(keep_indices)
    
    print(f"Fuzzy deduplication (threshold={threshold}):")
    print(f"  Analyzed: {sample_size:,} examples")
    print(f"  Near-duplicates removed: {removed:,} ({removed/sample_size*100:.1f}%)")
    print(f"  Kept: {len(keep_indices):,} examples")
    print()
    print("For production with large datasets, use datasketch MinHash LSH:")
    print("  from datasketch import MinHash, MinHashLSH")
    
    return dataset.select(keep_indices)


# Demo fuzzy dedup on a small sample
dolly_fuzzy_deduped = fuzzy_dedup_small(
    dolly_deduped.select(range(1000)),  # Demo on 1000 examples
    text_field="instruction",
    threshold=0.85
)

7. Data Cleaning: Quality Filtering¶

# Quality filtering functions

def is_too_short(text, min_chars=20):
    """Flag examples with very short responses."""
    return len(text.strip()) < min_chars


def is_too_long(text, max_chars=8000):
    """Flag examples with very long responses (often data errors)."""
    return len(text.strip()) > max_chars


def has_high_repetition(text, threshold=0.3):
    """
    Detect repetitive text (a common sign of model-generated garbage).
    Returns True if the text has too many repeated n-grams.
    """
    words = text.lower().split()
    if len(words) < 10:
        return False
    # Count bigrams
    bigrams = [(words[i], words[i+1]) for i in range(len(words)-1)]
    bigram_counts = Counter(bigrams)
    # If any bigram appears more than 30% of the time, flag as repetitive
    most_common_count = bigram_counts.most_common(1)[0][1] if bigram_counts else 0
    return most_common_count / len(bigrams) > threshold


def has_encoding_artifacts(text):
    """
    Detect common encoding artifacts from web scraping.
    """
    artifacts = [
        r"â€",          # UTF-8 misinterpretation
        r"&amp;",       # HTML entities
        r"&lt;",
        r"&gt;",
        r"\\n\\n\\n",  # Escaped newlines (not real newlines)
        r"</?[a-z]+>",  # Leftover HTML tags
    ]
    for pattern in artifacts:
        if re.search(pattern, text):
            return True
    return False


def is_mostly_non_ascii(text, threshold=0.3):
    """
    Detect text that is mostly non-ASCII characters.
    Useful for language filtering when training English models.
    """
    if not text:
        return False
    non_ascii = sum(1 for c in text if ord(c) > 127)
    return non_ascii / len(text) > threshold


def quality_filter(example, response_field="response", instruction_field="instruction"):
    """
    Combined quality filter. Returns True if the example passes all checks.
    Use with dataset.filter(quality_filter).
    """
    instruction = example.get(instruction_field, "")
    response = example.get(response_field, "")
    
    # Skip empty examples
    if not instruction or not response:
        return False
    
    # Response quality checks
    if is_too_short(response, min_chars=30):
        return False
    if is_too_long(response, max_chars=10000):
        return False
    if has_high_repetition(response):
        return False
    if has_encoding_artifacts(response):
        return False
    
    # Instruction quality checks
    if is_too_short(instruction, min_chars=10):
        return False
    
    return True


# Apply to dolly dataset
print(f"Before quality filtering: {len(dolly_deduped):,} examples")
dolly_filtered = dolly_deduped.filter(quality_filter)
print(f"After quality filtering:  {len(dolly_filtered):,} examples")
removed = len(dolly_deduped) - len(dolly_filtered)
print(f"Removed: {removed:,} low-quality examples ({removed/len(dolly_deduped)*100:.1f}%)")

8. Data Cleaning: Length Filtering and Token Analysis¶

LLMs process text as tokens, not characters. You need to filter by token length to:

Remove examples that exceed your model’s context window
Remove micro-responses (too short to be useful)
Understand your data’s distribution before setting max_seq_length

# Load a tokenizer for length analysis
# GPT-2 tokenizer is a reasonable proxy for most modern models
# In production, use the actual model's tokenizer

print("Loading tokenizer for length analysis...")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
print(f"Tokenizer loaded: {tokenizer.__class__.__name__}")
print()

# Compute token counts for a sample of the dataset
SAMPLE_SIZE = 2000  # Use a sample for speed
sample = dolly_filtered.select(range(min(SAMPLE_SIZE, len(dolly_filtered))))

def count_tokens(example, tokenizer, field="response"):
    """Count tokens in a text field."""
    text = example.get(field, "")
    tokens = tokenizer.encode(text, add_special_tokens=False)
    return {f"{field}_tokens": len(tokens)}


print(f"Computing token counts for {len(sample):,} examples...")
sample = sample.map(
    lambda ex: count_tokens(ex, tokenizer, "instruction"),
    desc="Counting instruction tokens"
)
sample = sample.map(
    lambda ex: count_tokens(ex, tokenizer, "response"),
    desc="Counting response tokens"
)

instruction_tokens = sample["instruction_tokens"]
response_tokens = sample["response_tokens"]

# Add combined length (instruction + response + some overhead)
total_tokens = [i + r + 10 for i, r in zip(instruction_tokens, response_tokens)]

print("\nToken Length Statistics:")
print(f"{'Metric':<25} {'Instruction':>12} {'Response':>12} {'Total':>12}")
print("-" * 65)
for stat_name, stat_fn in [("Mean", np.mean), ("Median", np.median), ("Std Dev", np.std),
                            ("Min", np.min), ("Max", np.max),
                            ("95th percentile", lambda x: np.percentile(x, 95)),
                            ("99th percentile", lambda x: np.percentile(x, 99))]:
    print(f"{stat_name:<25} {stat_fn(instruction_tokens):>12.0f} {stat_fn(response_tokens):>12.0f} {stat_fn(total_tokens):>12.0f}")

print()
# Recommend max_seq_length
p95 = np.percentile(total_tokens, 95)
p99 = np.percentile(total_tokens, 99)
print(f"Recommendation:")
print(f"  max_seq_length = {int(p95):,}   (covers 95% of examples, lower VRAM usage)")
print(f"  max_seq_length = {int(p99):,}   (covers 99% of examples, higher VRAM usage)")

# Apply length filtering

MAX_TOKENS = 2048   # Set based on your GPU memory and model context window
MIN_TOKENS = 10     # Filter out micro-responses

def length_filter(example, max_tokens=MAX_TOKENS, min_tokens=MIN_TOKENS):
    """
    Filter examples by total token count.
    Uses character count as a proxy (avoids tokenizing every example).
    Roughly 1 token ≈ 4 characters for English text.
    """
    instruction = example.get("instruction", "")
    response = example.get("response", "")
    context = example.get("context", "")
    
    # Character-based estimate (fast, no tokenizer needed)
    total_chars = len(instruction) + len(response) + len(context)
    estimated_tokens = total_chars // 4
    
    response_chars = len(response)
    estimated_response_tokens = response_chars // 4
    
    return (
        estimated_tokens <= max_tokens and
        estimated_response_tokens >= min_tokens
    )


print(f"Before length filtering: {len(dolly_filtered):,} examples")
dolly_length_filtered = dolly_filtered.filter(length_filter)
print(f"After length filtering:  {len(dolly_length_filtered):,} examples")
removed = len(dolly_filtered) - len(dolly_length_filtered)
print(f"Removed: {removed:,} examples outside length bounds")

9. Creating Your Own Custom Dataset¶

Often you will need to create a dataset from scratch for your specific use case. Here are the main approaches.

# Method 1: Create from a list of dictionaries

custom_examples = [
    {
        "instruction": "What is the capital of France?",
        "input": "",
        "output": "The capital of France is Paris."
    },
    {
        "instruction": "Write a Python function to check if a number is prime.",
        "input": "",
        "output": """def is_prime(n):
    if n < 2:
        return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0:
            return False
    return True"""
    },
    {
        "instruction": "Classify the sentiment of the following review.",
        "input": "The product arrived on time and works perfectly. Very happy with my purchase!",
        "output": "Positive"
    },
    {
        "instruction": "Translate to Spanish.",
        "input": "Good morning, how are you today?",
        "output": "Buenos días, ¿cómo estás hoy?"
    },
    {
        "instruction": "Summarize the main idea in one sentence.",
        "input": "Neural networks are computing systems inspired by biological neural networks. They learn to perform tasks by analyzing training examples, without being explicitly programmed with task-specific rules.",
        "output": "Neural networks are brain-inspired computing systems that learn from examples rather than explicit programming."
    },
]

custom_dataset = Dataset.from_list(custom_examples)
print(f"Custom dataset created: {len(custom_dataset)} examples")
print(f"Columns: {custom_dataset.column_names}")
print()
print("First example:")
print(json.dumps(custom_dataset[0], indent=2))

# Method 2: Create from a pandas DataFrame

import pandas as pd

# Simulate reading from a CSV/spreadsheet
data = {
    "input_text": [
        "Customer: My order hasn't arrived after 2 weeks.",
        "Customer: I received the wrong item in my package.",
        "Customer: Can I return this product for a refund?",
        "Customer: The product is defective and stopped working.",
        "Customer: I was charged twice for the same order.",
    ],
    "expected_output": [
        "I sincerely apologize for the delay! I'll immediately check your tracking number and escalate this to our shipping team. You'll receive an update within 24 hours, and if the item is lost, we'll send a replacement at no charge.",
        "I'm sorry you received the wrong item! Please keep it, and I'll arrange for the correct item to be shipped to you today with express delivery at no additional cost.",
        "Absolutely! Our return policy allows returns within 30 days of purchase. I'll email you a prepaid return label, and your refund will be processed within 3-5 business days of receiving the item.",
        "I apologize for the defective product! This is covered under our warranty. I'll send you a replacement unit with priority shipping, or if you prefer, issue a full refund. Which would you like?",
        "I'm very sorry about the duplicate charge! This was an error on our end. I'll process a refund for the duplicate charge immediately, and it should appear on your statement within 2-3 business days.",
    ],
    "category": ["shipping", "wrong_item", "returns", "defective", "billing"]
}

df = pd.DataFrame(data)

# Convert to the instruction-response format
df["instruction"] = "You are a helpful customer service agent. Respond to the following customer complaint professionally and empathetically."
df["input"] = df["input_text"]
df["output"] = df["expected_output"]

# Create HuggingFace Dataset from DataFrame
customer_service_dataset = Dataset.from_pandas(df[["instruction", "input", "output", "category"]])

print(f"Customer service dataset: {len(customer_service_dataset)} examples")
print(f"Columns: {customer_service_dataset.column_names}")
print()
print("Example entry:")
example = customer_service_dataset[0]
print(f"  Instruction: {example['instruction'][:80]}...")
print(f"  Input:       {example['input']}")
print(f"  Output:      {example['output'][:100]}...")

# Method 3: Load from JSONL file (common format for fine-tuning data)

import tempfile
import os

# Create a sample JSONL file
sample_jsonl_data = [
    {"instruction": "What is machine learning?", "output": "Machine learning is a subset of AI where systems learn from data."},
    {"instruction": "Explain gradient descent.", "output": "Gradient descent is an optimization algorithm that iteratively moves toward the minimum of a loss function."},
    {"instruction": "What is overfitting?", "output": "Overfitting is when a model learns the training data too well, including noise, causing poor performance on new data."},
]

# Write to a temporary JSONL file
with tempfile.NamedTemporaryFile(mode='w', suffix='.jsonl', delete=False) as f:
    for item in sample_jsonl_data:
        f.write(json.dumps(item) + '\n')
    jsonl_path = f.name

# Load from JSONL
dataset_from_jsonl = load_dataset("json", data_files=jsonl_path, split="train")
print(f"Loaded from JSONL: {len(dataset_from_jsonl)} examples")
print(f"Columns: {dataset_from_jsonl.column_names}")

# Clean up temp file
os.unlink(jsonl_path)

print("\nAll dataset creation methods demonstrated successfully.")

10. Data Augmentation Techniques¶

When you have a small dataset, augmentation can increase diversity and reduce overfitting. LLM-based augmentation is the most effective approach in 2026.

# Technique 1: Instruction Paraphrasing
# Generate multiple phrasings of the same instruction to increase diversity

def paraphrase_instruction(instruction, num_variations=3):
    """
    Generate paraphrased versions of an instruction using template patterns.
    
    In production, use an LLM (e.g., GPT-4, Claude) to generate better paraphrases:
        prompt = f'Paraphrase this instruction {num_variations} different ways: {instruction}'
    """
    # Template-based paraphrasing patterns
    prefix_patterns = [
        "Please {}",
        "Can you {}?",
        "I need you to {}",
        "Your task is to {}",
        "Help me {}",
    ]
    
    # Simple heuristic: lower the instruction and add different prefixes
    base = instruction.rstrip('.').rstrip('?').lower()
    # Remove existing imperative prefixes
    for prefix in ["please ", "can you ", "help me ", "write ", "create ", "generate "]:
        if base.startswith(prefix):
            base = base[len(prefix):]
            break
    
    variations = [instruction]  # Always keep original
    used_patterns = random.sample(prefix_patterns, min(num_variations, len(prefix_patterns)))
    
    for pattern in used_patterns:
        variation = pattern.format(base)
        if variation != instruction:
            variations.append(variation)
    
    return variations[:num_variations + 1]


# Demonstrate
test_instruction = "Explain the concept of recursion in programming."
variations = paraphrase_instruction(test_instruction, num_variations=4)

print("Instruction paraphrasing example:")
print(f"  Original:   {variations[0]}")
for i, var in enumerate(variations[1:], 1):
    print(f"  Variation {i}: {var}")

# Technique 2: Augment dataset by expanding with paraphrased instructions

def augment_with_paraphrases(dataset, instruction_field="instruction",
                              response_field="output", num_variations=2):
    """
    Expand dataset by adding paraphrased instruction variants.
    Each original example generates num_variations additional examples.
    """
    augmented_examples = []
    
    for example in dataset:
        original_instruction = example[instruction_field]
        response = example[response_field]
        
        # Keep original
        augmented_examples.append({
            "instruction": original_instruction,
            "output": response,
            "augmented": False
        })
        
        # Add paraphrased versions
        variations = paraphrase_instruction(original_instruction, num_variations=num_variations)
        for var in variations[1:][:num_variations]:  # Skip original, take variations
            augmented_examples.append({
                "instruction": var,
                "output": response,
                "augmented": True
            })
    
    return Dataset.from_list(augmented_examples)


# Demo on small subset
small_sample = custom_dataset.select(range(min(5, len(custom_dataset))))
augmented = augment_with_paraphrases(small_sample, response_field="output")

print(f"Before augmentation: {len(small_sample)} examples")
print(f"After augmentation:  {len(augmented)} examples")
print(f"Augmented examples:  {sum(augmented['augmented'])}")
print()
print("Note: For production, use LLM-based paraphrasing (GPT-4, Claude API)")
print("for higher quality variations. Template paraphrasing is a quick baseline.")

# Technique 3: LLM-assisted data generation (production approach)
# This shows the pattern for using an LLM to generate training data
# Requires an API key - shown as a template

def generate_training_examples_with_llm(topic, num_examples=5, openai_client=None):
    """
    Generate training examples using an LLM.
    This is the production approach for building high-quality datasets.
    
    Args:
        topic: The topic/domain for example generation
        num_examples: How many examples to generate
        openai_client: Initialized OpenAI client (or None for demo)
    """
    generation_prompt = f"""
Generate {num_examples} high-quality instruction-response pairs for fine-tuning a language model.
Topic: {topic}

Requirements:
- Instructions should be clear and unambiguous
- Responses should be accurate, helpful, and appropriately detailed
- Vary the complexity and style of instructions
- Include both simple and complex examples

Output as JSON array with format:
[{{"instruction": "...", "response": "..."}}]
"""
    
    if openai_client is None:
        # Demo mode: return synthetic examples
        print(f"[Demo mode - would call LLM with prompt:]")
        print(generation_prompt[:300], "...")
        return [
            {"instruction": f"Example instruction about {topic} #{i+1}",
             "response": f"Example response about {topic} #{i+1}"}
            for i in range(num_examples)
        ]
    else:
        # Real LLM call
        response = openai_client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": generation_prompt}],
            temperature=0.8
        )
        content = response.choices[0].message.content
        # Parse JSON from response
        json_match = re.search(r'\[.*\]', content, re.DOTALL)
        if json_match:
            return json.loads(json_match.group())
        return []


# Demo the function
generated = generate_training_examples_with_llm("Python pandas data manipulation", num_examples=3)
print(f"\nGenerated {len(generated)} examples")
print("\nIn production:")
print("  1. Use GPT-4, Claude, or another capable model as the data generator")
print("  2. Generate 1000-5000 examples per topic")
print("  3. Always review a sample manually before training")
print("  4. Filter generated data with the same quality checks as human data")

11. Formatting Functions for Different Model Families¶

Each model family uses a different chat template. You must format your data to match the exact template the model was trained with — using a different format degrades performance significantly.

# Manual formatting functions for major model families
# These are useful for understanding the formats, but in practice
# use tokenizer.apply_chat_template() (shown in the next cell)

def format_for_llama3(instruction, response, system_prompt=None, input_text=None):
    """
    Format for Llama 3.x models (Meta-Llama-3-8B-Instruct, Llama-3.1-*, Llama-3.3-*)
    """
    system = system_prompt or "You are a helpful, harmless, and honest assistant."
    user_content = f"{instruction}\n\n{input_text}" if input_text else instruction
    
    return (
        f"<|begin_of_text|>"
        f"<|start_header_id|>system<|end_header_id|>\n\n{system}<|eot_id|>"
        f"<|start_header_id|>user<|end_header_id|>\n\n{user_content}<|eot_id|>"
        f"<|start_header_id|>assistant<|end_header_id|>\n\n{response}<|eot_id|>"
    )


def format_for_qwen(instruction, response, system_prompt=None, input_text=None):
    """
    Format for Qwen 2.5 models (uses ChatML format)
    Also works for: OpenHermes, Mistral (v0.3+), and many others
    """
    system = system_prompt or "You are a helpful assistant."
    user_content = f"{instruction}\n\n{input_text}" if input_text else instruction
    
    return (
        f"<|im_start|>system\n{system}<|im_end|>\n"
        f"<|im_start|>user\n{user_content}<|im_end|>\n"
        f"<|im_start|>assistant\n{response}<|im_end|>"
    )


def format_for_phi4(instruction, response, system_prompt=None, input_text=None):
    """
    Format for Phi-4 (Microsoft)
    """
    system = system_prompt or "You are a helpful AI assistant."
    user_content = f"{instruction}\n\n{input_text}" if input_text else instruction
    
    return (
        f"<|system|>\n{system}<|end|>\n"
        f"<|user|>\n{user_content}<|end|>\n"
        f"<|assistant|>\n{response}<|end|>"
    )


def format_for_gemma2(instruction, response, input_text=None):
    """
    Format for Gemma 2 (Google)
    Note: Gemma 2 does not use a system prompt in its standard template
    """
    user_content = f"{instruction}\n\n{input_text}" if input_text else instruction
    
    return (
        f"<start_of_turn>user\n{user_content}<end_of_turn>\n"
        f"<start_of_turn>model\n{response}<end_of_turn>"
    )


# Demonstrate all formats with the same example
test_instruction = "What is the difference between a list and a tuple in Python?"
test_response = "Lists are mutable (can be changed), while tuples are immutable (cannot be changed after creation). Lists use square brackets [], tuples use parentheses (). Tuples are slightly faster and use less memory."

formatters = [
    ("Llama 3.x", format_for_llama3),
    ("Qwen 2.5 / ChatML", format_for_qwen),
    ("Phi-4", format_for_phi4),
    ("Gemma 2", format_for_gemma2),
]

for model_name, formatter in formatters:
    print(f"\n{'='*60}")
    print(f"Format: {model_name}")
    print("='*60")
    if model_name == "Gemma 2":
        formatted = formatter(test_instruction, test_response)
    else:
        formatted = formatter(test_instruction, test_response)
    print(formatted)

# The RECOMMENDED approach: use tokenizer.apply_chat_template()
# This is the modern, correct way - it uses the exact template the model was designed for

def format_with_chat_template(example, tokenizer, response_field="response",
                               instruction_field="instruction",
                               input_field="input",
                               system_prompt=None):
    """
    Format a dataset example using the tokenizer's built-in chat template.
    This is the correct approach for any Hugging Face model.
    
    Args:
        example: Dataset example dict
        tokenizer: The model's tokenizer with chat_template defined
        response_field: Name of the response/output field
        instruction_field: Name of the instruction field
        input_field: Name of the input context field
        system_prompt: Optional system prompt string
    
    Returns:
        Dict with 'text' key containing formatted string
    """
    instruction = example.get(instruction_field, "")
    response = example.get(response_field, "")
    input_text = example.get(input_field, "")
    
    # Build user content
    if input_text and input_text.strip():
        user_content = f"{instruction}\n\n{input_text}"
    else:
        user_content = instruction
    
    # Build messages list
    messages = []
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})
    messages.append({"role": "user", "content": user_content})
    messages.append({"role": "assistant", "content": response})
    
    # Apply the tokenizer's chat template
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,         # Return string, not token IDs
        add_generation_prompt=False  # Include the assistant response (for training)
    )
    
    return {"text": text}


# Demo with GPT-2 tokenizer (which has a basic template)
# In production, use the actual model tokenizer:
#   tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

print("apply_chat_template() demo:")
print()
print("In production, load the actual model's tokenizer:")
print("  # For Qwen 2.5:")
print("  tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-7B-Instruct')")
print()
print("  # For Llama 3.3:")
print("  tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.3-70B-Instruct')")
print()
print("  # Then apply to dataset:")
print("  dataset = dataset.map(lambda ex: format_with_chat_template(ex, tokenizer))")
print()
print("The tokenizer's chat_template ensures correct special tokens are used.")
print("Never manually construct chat templates for models you plan to use in production.")

# Demonstrate the messages structure
sample_messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain gradient descent in one paragraph."},
    {"role": "assistant", "content": "Gradient descent is an optimization algorithm that iteratively adjusts model parameters in the direction that reduces a loss function."}
]
print()
print("Messages format (model-agnostic):")
print(json.dumps(sample_messages, indent=2))

12. Train/Validation Split Best Practices¶

# Best practices for train/validation splitting

def create_splits(dataset, train_size=0.9, val_size=0.1, seed=42, stratify_field=None):
    """
    Create train/validation splits with optional stratification.
    
    Args:
        dataset: HuggingFace Dataset
        train_size: Fraction for training (default 0.9)
        val_size: Fraction for validation (default 0.1)
        seed: Random seed for reproducibility
        stratify_field: Column name to stratify by (ensures balanced categories)
    
    Returns:
        DatasetDict with 'train' and 'validation' keys
    """
    assert abs(train_size + val_size - 1.0) < 1e-6, "train_size + val_size must equal 1.0"
    
    if stratify_field and stratify_field in dataset.column_names:
        # Stratified split: preserve category proportions in both splits
        # Get unique categories
        categories = list(set(dataset[stratify_field]))
        train_indices = []
        val_indices = []
        
        rng = random.Random(seed)
        
        for cat in categories:
            # Get all indices for this category
            cat_indices = [i for i, ex in enumerate(dataset) if ex[stratify_field] == cat]
            rng.shuffle(cat_indices)
            
            # Split proportionally
            split_idx = max(1, int(len(cat_indices) * train_size))
            train_indices.extend(cat_indices[:split_idx])
            val_indices.extend(cat_indices[split_idx:])
        
        # Shuffle final indices
        rng.shuffle(train_indices)
        rng.shuffle(val_indices)
        
        return DatasetDict({
            "train": dataset.select(train_indices),
            "validation": dataset.select(val_indices)
        })
    else:
        # Simple random split
        split = dataset.train_test_split(
            test_size=val_size,
            seed=seed
        )
        return DatasetDict({
            "train": split["train"],
            "validation": split["test"]
        })


# Apply to dolly dataset with stratification
print("Creating stratified train/validation split...")
dataset_splits = create_splits(
    dolly_length_filtered,
    train_size=0.9,
    val_size=0.1,
    seed=42,
    stratify_field="category"
)

print(f"Total examples: {len(dolly_length_filtered):,}")
print(f"Train split:    {len(dataset_splits['train']):,} ({len(dataset_splits['train'])/len(dolly_length_filtered)*100:.1f}%)")
print(f"Val split:      {len(dataset_splits['validation']):,} ({len(dataset_splits['validation'])/len(dolly_length_filtered)*100:.1f}%)")
print()

# Verify category distribution is maintained
print("Category distribution in splits:")
train_cats = Counter(dataset_splits['train']['category'])
val_cats = Counter(dataset_splits['validation']['category'])
all_cats = sorted(set(list(train_cats.keys()) + list(val_cats.keys())))

print(f"{'Category':<35} {'Train':>8} {'Val':>8} {'Train%':>8}")
print("-" * 62)
for cat in all_cats:
    t = train_cats.get(cat, 0)
    v = val_cats.get(cat, 0)
    total = t + v
    pct = t / total * 100 if total > 0 else 0
    print(f"{cat:<35} {t:>8} {v:>8} {pct:>7.1f}%")

# Validation set size guidelines

split_guidelines = [
    ("< 500 examples",     "20% val (100 min)", "You need enough validation examples to get reliable metrics"),
    ("500–2,000 examples", "15% val",           "Balance between training data and reliable evaluation"),
    ("2K–10K examples",    "10% val",           "Standard split for most fine-tuning tasks"),
    ("10K–100K examples",  "5% val",            "Large enough training set, smaller val% is OK"),
    ("> 100K examples",    "1–2% val",          "Even 1K validation examples is plenty for metrics"),
]

print("Train/Validation Split Guidelines:")
print()
print(f"{'Dataset Size':<25} {'Val Split':>12}   Notes")
print("-" * 75)
for size, split, notes in split_guidelines:
    print(f"{size:<25} {split:>12}   {notes}")

print()
print("IMPORTANT: Never evaluate on training data.")
print("IMPORTANT: Choose validation examples that represent real-world usage.")
print("IMPORTANT: Keep test set completely separate (do not tune on it).")

13. Dataset Statistics and Visualization¶

# Comprehensive dataset visualization

fig = plt.figure(figsize=(16, 12))
fig.suptitle("Dataset Analysis: Databricks Dolly-15K (Processed)", fontsize=15, fontweight="bold", y=0.98)

gs = gridspec.GridSpec(2, 3, figure=fig, hspace=0.4, wspace=0.35)

# Use the filtered dataset for visualization
viz_df = dolly_length_filtered.to_pandas()
viz_df["instruction_chars"] = viz_df["instruction"].str.len()
viz_df["response_chars"] = viz_df["response"].str.len()
viz_df["context_chars"] = viz_df["context"].str.len()

# Estimate token lengths
viz_df["instruction_tokens"] = (viz_df["instruction_chars"] / 4).astype(int)
viz_df["response_tokens"] = (viz_df["response_chars"] / 4).astype(int)
viz_df["total_tokens"] = viz_df["instruction_tokens"] + viz_df["response_tokens"]

# --- Plot 1: Response token length distribution ---
ax1 = fig.add_subplot(gs[0, 0])
ax1.hist(viz_df["response_tokens"], bins=50, color="steelblue", alpha=0.85, edgecolor="white")
ax1.axvline(viz_df["response_tokens"].median(), color="red", linestyle="--", linewidth=1.5,
            label=f"Median: {viz_df['response_tokens'].median():.0f}")
ax1.axvline(viz_df["response_tokens"].quantile(0.95), color="orange", linestyle=":",
            linewidth=1.5, label=f"P95: {viz_df['response_tokens'].quantile(0.95):.0f}")
ax1.set_xlabel("Response Length (est. tokens)")
ax1.set_ylabel("Count")
ax1.set_title("Response Length Distribution")
ax1.legend(fontsize=8)

# --- Plot 2: Instruction token length distribution ---
ax2 = fig.add_subplot(gs[0, 1])
ax2.hist(viz_df["instruction_tokens"], bins=40, color="seagreen", alpha=0.85, edgecolor="white")
ax2.axvline(viz_df["instruction_tokens"].median(), color="red", linestyle="--", linewidth=1.5,
            label=f"Median: {viz_df['instruction_tokens'].median():.0f}")
ax2.set_xlabel("Instruction Length (est. tokens)")
ax2.set_ylabel("Count")
ax2.set_title("Instruction Length Distribution")
ax2.legend(fontsize=8)

# --- Plot 3: Category distribution (horizontal bar) ---
ax3 = fig.add_subplot(gs[0, 2])
cat_counts = viz_df["category"].value_counts()
colors = plt.cm.Set3(np.linspace(0, 1, len(cat_counts)))
bars = ax3.barh(range(len(cat_counts)), cat_counts.values, color=colors)
ax3.set_yticks(range(len(cat_counts)))
ax3.set_yticklabels(cat_counts.index, fontsize=8)
ax3.set_xlabel("Number of Examples")
ax3.set_title("Category Distribution")
for bar, val in zip(bars, cat_counts.values):
    ax3.text(bar.get_width() + 5, bar.get_y() + bar.get_height()/2,
             str(val), va='center', fontsize=7)

# --- Plot 4: Total token length (training sequence length) ---
ax4 = fig.add_subplot(gs[1, 0])
ax4.hist(viz_df["total_tokens"], bins=50, color="coral", alpha=0.85, edgecolor="white")
for pct, color, label in [(50, "blue", "P50"), (90, "orange", "P90"), (95, "red", "P95")]:
    val = viz_df["total_tokens"].quantile(pct/100)
    ax4.axvline(val, color=color, linestyle="--", linewidth=1.5,
                label=f"{label}: {val:.0f} tokens")
ax4.set_xlabel("Total Sequence Length (est. tokens)")
ax4.set_ylabel("Count")
ax4.set_title("Total Sequence Length\n(Instruction + Response)")
ax4.legend(fontsize=7)

# --- Plot 5: Scatter: instruction length vs response length ---
ax5 = fig.add_subplot(gs[1, 1])
sample_mask = viz_df.sample(min(1000, len(viz_df)), random_state=42).index
ax5.scatter(
    viz_df.loc[sample_mask, "instruction_tokens"],
    viz_df.loc[sample_mask, "response_tokens"],
    alpha=0.3, s=10, color="purple"
)
ax5.set_xlabel("Instruction Tokens")
ax5.set_ylabel("Response Tokens")
ax5.set_title("Instruction vs Response Length\n(1K random sample)")

# --- Plot 6: Context length distribution (examples with context) ---
ax6 = fig.add_subplot(gs[1, 2])
has_context = viz_df[viz_df["context_chars"] > 0]
no_context_count = len(viz_df) - len(has_context)

pie_data = [len(has_context), no_context_count]
pie_labels = [f"With context\n({len(has_context):,})", f"No context\n({no_context_count:,})"]
ax6.pie(pie_data, labels=pie_labels, colors=["steelblue", "lightgray"],
        autopct="%1.1f%%", startangle=90)
ax6.set_title("Examples With / Without Context")

plt.savefig("/tmp/dataset_analysis.png", dpi=120, bbox_inches="tight")
plt.show()
print("Dataset analysis visualization complete.")
print("Saved to /tmp/dataset_analysis.png")

# Print a comprehensive statistics summary

def print_dataset_summary(dataset, name="Dataset", response_field="response",
                          instruction_field="instruction"):
    """
    Print a formatted summary of dataset statistics.
    """
    df = dataset.to_pandas() if not isinstance(dataset, pd.DataFrame) else dataset
    
    print(f"\n{'='*60}")
    print(f"DATASET SUMMARY: {name}")
    print(f"{'='*60}")
    print(f"Total examples: {len(df):,}")
    print(f"Columns: {list(df.columns)}")
    
    if response_field in df.columns:
        resp_chars = df[response_field].str.len()
        resp_tokens = (resp_chars / 4).astype(int)
        print(f"\nResponse Statistics:")
        print(f"  Avg length:     {resp_chars.mean():.0f} chars / {resp_tokens.mean():.0f} tokens")
        print(f"  Median length:  {resp_chars.median():.0f} chars / {resp_tokens.median():.0f} tokens")
        print(f"  Min length:     {resp_chars.min():.0f} chars / {resp_tokens.min():.0f} tokens")
        print(f"  Max length:     {resp_chars.max():.0f} chars / {resp_tokens.max():.0f} tokens")
        print(f"  P95 length:     {resp_chars.quantile(0.95):.0f} chars / {resp_tokens.quantile(0.95):.0f} tokens")
    
    if instruction_field in df.columns:
        instr_chars = df[instruction_field].str.len()
        print(f"\nInstruction Statistics:")
        print(f"  Avg length:     {instr_chars.mean():.0f} chars")
        print(f"  Median length:  {instr_chars.median():.0f} chars")
    
    if "category" in df.columns:
        print(f"\nCategory Distribution:")
        for cat, count in df["category"].value_counts().items():
            pct = count / len(df) * 100
            print(f"  {cat:<35} {count:>6} ({pct:.1f}%)")
    
    print(f"{'='*60}\n")


print_dataset_summary(dolly_length_filtered, name="Dolly-15K (Processed)")
print_dataset_summary(dataset_splits['train'], name="Train Split")
print_dataset_summary(dataset_splits['validation'], name="Validation Split")

14. Saving and Versioning Datasets¶

# Saving datasets — multiple formats

import os

SAVE_DIR = "/tmp/finetuning_dataset"
os.makedirs(SAVE_DIR, exist_ok=True)

# --- Method 1: Save as Hugging Face Dataset (Arrow format) ---
# Best for: Fast loading in Python, large datasets, used with .load_from_disk()
save_path_arrow = f"{SAVE_DIR}/dolly_processed"
dataset_splits.save_to_disk(save_path_arrow)
print(f"Saved as Arrow format to: {save_path_arrow}")
print(f"  Files: {os.listdir(save_path_arrow)}")

# Reload
from datasets import load_from_disk
reloaded = load_from_disk(save_path_arrow)
print(f"  Reloaded successfully: train={len(reloaded['train']):,}, val={len(reloaded['validation']):,}")
print()

# --- Method 2: Save as JSONL (most portable format) ---
# Best for: Sharing with others, version control, inspecting with text editors

def save_as_jsonl(dataset, filepath):
    """Save a HuggingFace Dataset as JSONL."""
    with open(filepath, 'w', encoding='utf-8') as f:
        for example in dataset:
            f.write(json.dumps(example, ensure_ascii=False) + '\n')
    size_mb = os.path.getsize(filepath) / (1024 * 1024)
    print(f"Saved {len(dataset):,} examples to {filepath} ({size_mb:.2f} MB)")


train_jsonl_path = f"{SAVE_DIR}/train.jsonl"
val_jsonl_path = f"{SAVE_DIR}/validation.jsonl"

save_as_jsonl(dataset_splits['train'], train_jsonl_path)
save_as_jsonl(dataset_splits['validation'], val_jsonl_path)

# Reload from JSONL
reloaded_train = load_dataset("json", data_files=train_jsonl_path, split="train")
print(f"Reloaded from JSONL: {len(reloaded_train):,} examples")
print()

# --- Method 3: Push to Hugging Face Hub ---
# Best for: Sharing publicly, collaboration, dataset versioning
# Requires: huggingface-cli login or HF_TOKEN environment variable

print("Pushing to Hugging Face Hub (template — requires HF token):")
print()
print("""
# First, login:
# huggingface-cli login
# OR set environment variable: export HF_TOKEN=your_token

# Push full DatasetDict:
dataset_splits.push_to_hub(
    repo_id="your-username/dolly-15k-processed",
    private=True   # Set False to make public
)

# Push just the training split:
dataset_splits['train'].push_to_hub(
    repo_id="your-username/dolly-15k-processed",
    split="train",
    private=True
)

# Load it back from Hub:
from datasets import load_dataset
dataset = load_dataset("your-username/dolly-15k-processed")
""")
print()

# --- Method 4: Dataset card / versioning best practices ---
print("Dataset versioning best practices:")
print("  1. Use semantic versioning: v1.0.0, v1.1.0, v2.0.0")
print("  2. Document the processing steps (what was filtered, why)")
print("  3. Save the exact filtering code alongside the data")
print("  4. Log data statistics (size, length distribution, categories)")
print("  5. Record the tokenizer used for length filtering")
print("  6. Store train/val split seeds for reproducibility")


# Save a metadata file alongside the dataset
metadata = {
    "version": "1.0.0",
    "base_dataset": "databricks/databricks-dolly-15k",
    "processing_date": "2026-02-28",
    "total_examples": len(dolly_length_filtered),
    "train_examples": len(dataset_splits['train']),
    "val_examples": len(dataset_splits['validation']),
    "train_val_split": "90/10 stratified by category",
    "random_seed": 42,
    "processing_steps": [
        "exact_deduplication",
        "quality_filtering_min_response_30_chars",
        "quality_filtering_max_response_10000_chars",
        "repetition_filter_bigram_threshold_0.3",
        "length_filter_max_2048_estimated_tokens",
        "stratified_train_val_split"
    ],
    "tokenizer_for_length_estimate": "gpt2 (4 chars per token approximation)"
}

metadata_path = f"{SAVE_DIR}/dataset_metadata.json"
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)
print(f"\nMetadata saved to: {metadata_path}")
print(json.dumps(metadata, indent=2))

15. Complete Dataset Preparation Pipeline¶

Putting it all together into a reusable function.

def prepare_dataset(
    dataset_id,
    instruction_field="instruction",
    response_field="response",
    input_field=None,
    category_field=None,
    min_response_chars=30,
    max_total_tokens=2048,
    train_size=0.9,
    seed=42,
    save_dir=None,
):
    """
    Complete dataset preparation pipeline.
    
    Applies: deduplication → quality filtering → length filtering → train/val split.
    
    Args:
        dataset_id: HuggingFace dataset ID (e.g., 'databricks/databricks-dolly-15k')
        instruction_field: Column name for instructions
        response_field: Column name for responses/outputs
        input_field: Column name for input context (optional)
        category_field: Column to stratify splits by (optional)
        min_response_chars: Minimum response character length
        max_total_tokens: Maximum total token estimate
        train_size: Fraction of data for training
        seed: Random seed
        save_dir: Directory to save processed dataset (optional)
    
    Returns:
        DatasetDict with 'train' and 'validation' keys
    """
    print(f"[1/6] Loading dataset: {dataset_id}")
    raw_dataset = load_dataset(dataset_id, split="train")
    print(f"      Loaded {len(raw_dataset):,} examples")
    
    print(f"[2/6] Exact deduplication on '{instruction_field}'")
    deduped = exact_dedup(raw_dataset, text_field=instruction_field)
    
    print(f"[3/6] Quality filtering")
    filtered = deduped.filter(
        lambda ex: quality_filter(ex,
                                  response_field=response_field,
                                  instruction_field=instruction_field)
    )
    print(f"      After filtering: {len(filtered):,} examples")
    
    print(f"[4/6] Length filtering (max ~{max_total_tokens} tokens)")
    max_chars = max_total_tokens * 4  # chars per token estimate
    length_filtered = filtered.filter(
        lambda ex: (
            len(ex.get(instruction_field, "")) +
            len(ex.get(response_field, "")) +
            len(ex.get(input_field or "", ""))
        ) <= max_chars and len(ex.get(response_field, "")) >= min_response_chars
    )
    print(f"      After length filtering: {len(length_filtered):,} examples")
    
    print(f"[5/6] Creating train/validation split (train={train_size:.0%})")
    splits = create_splits(
        length_filtered,
        train_size=train_size,
        val_size=1.0 - train_size,
        seed=seed,
        stratify_field=category_field
    )
    print(f"      Train: {len(splits['train']):,}, Validation: {len(splits['validation']):,}")
    
    if save_dir:
        print(f"[6/6] Saving to disk: {save_dir}")
        os.makedirs(save_dir, exist_ok=True)
        splits.save_to_disk(save_dir)
        print(f"      Saved successfully")
    else:
        print(f"[6/6] Skipping save (no save_dir provided)")
    
    print(f"\nDataset preparation complete!")
    print(f"  Final: {len(splits['train']):,} train + {len(splits['validation']):,} val")
    print(f"  Original: {len(raw_dataset):,} → Final: {len(splits['train']) + len(splits['validation']):,} ({(len(splits['train']) + len(splits['validation']))/len(raw_dataset)*100:.1f}% retained)")
    
    return splits


# Run the complete pipeline on Dolly
print("Running complete dataset preparation pipeline...")
print()
final_dataset = prepare_dataset(
    dataset_id="databricks/databricks-dolly-15k",
    instruction_field="instruction",
    response_field="response",
    input_field="context",
    category_field="category",
    min_response_chars=30,
    max_total_tokens=2048,
    train_size=0.9,
    seed=42,
    save_dir=f"{SAVE_DIR}/final_processed"
)

16. Key Takeaways¶

What You Learned in This Notebook¶

Data quality beats data quantity — 1,000 excellent examples outperform 100,000 mediocre ones. The LIMA paper proved this. Always manually inspect your data.
Three dataset formats for different objectives:
- Instruction-response (Alpaca): for teaching instruction following
- Conversations (ChatML): for multi-turn chat
- Preference pairs: for DPO alignment
Deduplication is essential — exact hash dedup + optional fuzzy dedup prevents your model from memorizing specific examples.
Always filter by token length before training. Exceeding max_seq_length silently truncates examples, causing training instability.
Use tokenizer.apply_chat_template() — never manually construct chat templates. Each model family has unique special tokens that must be exact.
Stratified splits preserve category balance between train and validation sets — critical for accurate evaluation.
Document everything — save metadata with your dataset so you can reproduce it later.

Common Pitfalls to Avoid¶

Forgetting to set labels to -100 for prompt tokens (you should only train on the response)
Using the wrong chat template for your model
Not deduplicating (leads to memorization)
Evaluating on training data
Including PII (personally identifiable information) in training data
Not shuffling before splitting

Next Steps¶

You now have a clean, formatted dataset ready for fine-tuning.

Next: 02_supervised_finetuning.ipynb

You will learn how to:

Set up TRL’s SFTTrainer for instruction fine-tuning
Configure training hyperparameters
Monitor training with W&B (Weights & Biases)
Run your first full fine-tuning job
Save and evaluate the fine-tuned model

Zero-to-AI Curriculum — Phase 12: LLM Fine-tuning
Last updated: February 2026