Dataset Preparation for LLM Fine-tuningΒΆ
Phase 12 β Notebook 01
Before you write a single line of training code, you need a high-quality dataset. This notebook covers everything you need to know about preparing data for LLM fine-tuning β from raw data to training-ready format.
What You Will LearnΒΆ
Why data quality matters more than quantity
The three main dataset formats: instruction-response, conversations, preference pairs
How to load and explore datasets from Hugging Face Hub
Data cleaning: deduplication, quality filtering, length filtering
How to build a custom dataset from scratch
Data augmentation techniques for small datasets
Formatting functions for Llama 3, Qwen 2.5, Phi-4, and Gemma 2
Train/validation split best practices
Saving and versioning your datasets
Prerequisite: Complete
00_START_HERE.ipynbbefore this notebook.
1. Install DependenciesΒΆ
# Install required libraries
# Uncomment and run if in a fresh environment (e.g., Colab)
# !pip install -q datasets transformers pandas numpy matplotlib tqdm
# !pip install -q datasketch # For fuzzy deduplication
print("Dependencies ready. Uncomment the lines above if you need to install them.")
2. ImportsΒΆ
import json
import re
import hashlib
import random
import warnings
from collections import Counter, defaultdict
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from datasets import (
Dataset,
DatasetDict,
load_dataset,
concatenate_datasets,
)
from transformers import AutoTokenizer
warnings.filterwarnings("ignore")
random.seed(42)
np.random.seed(42)
print("All imports successful.")
print()
# Verify key library versions
import datasets
import transformers
print(f"datasets: {datasets.__version__}")
print(f"transformers: {transformers.__version__}")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
3. Why Data Quality Matters More Than QuantityΒΆ
The single most important factor in fine-tuning success is data quality β not dataset size.
The Research EvidenceΒΆ
LIMA (2023): Fine-tuning Llama on just 1,000 carefully curated examples matched models trained on 50,000+ examples. The paperβs conclusion: βAlmost all knowledge in large language models is learned during pretraining, with only limited instruction tuning data necessary to teach models to produce high quality output.β
Alpaca (2023): 52,000 instruction-following examples from GPT-3.5. Quality was mediocre, leading to a model with noticeable hallucinations and inconsistencies.
Orca (2023): Fewer examples but with chain-of-thought reasoning traces β significantly better model behavior.
The Garbage In, Garbage Out PrincipleΒΆ
Low-quality data effects:
- Model learns incorrect facts
- Inconsistent response style
- Hallucinations increase
- Model behavior becomes unpredictable
- Wasted GPU compute
High-quality data effects:
- Consistent, reliable responses
- Correct factual information
- Predictable behavior
- Better generalization
- Faster convergence (fewer epochs needed)
Practical Guidance on Dataset SizeΒΆ
Task Complexity |
Minimum Examples |
Sweet Spot |
Diminishing Returns |
|---|---|---|---|
Simple formatting |
100β200 |
500β1K |
>5K |
Domain adaptation |
500β1K |
2Kβ10K |
>50K |
Instruction following |
1Kβ2K |
5Kβ20K |
>100K |
Complex reasoning |
2Kβ5K |
10Kβ50K |
>200K |
Alignment (DPO) |
500 pairs |
2Kβ10K pairs |
>50K pairs |
Rule of thumb: If you have 1,000 excellent examples, you will almost certainly outperform 100,000 mediocre ones.
4. Dataset FormatsΒΆ
There are three primary dataset formats for LLM fine-tuning. The format you need depends on your fine-tuning objective.
Format 1: Instruction-Response (Alpaca Format)ΒΆ
Used for: Supervised fine-tuning (SFT) to teach instruction following.
{
"instruction": "Translate the following English text to French.",
"input": "The weather is beautiful today.",
"output": "Le temps est magnifique aujourd'hui."
}
The input field is optional (for tasks that need context beyond the instruction).
Format 3: Preference Pairs (DPO Format)ΒΆ
Used for: Direct Preference Optimization (DPO) to align with human preferences.
{
"prompt": "Explain why the sky is blue.",
"chosen": "The sky appears blue due to Rayleigh scattering. When sunlight enters Earth's atmosphere, it collides with gas molecules. Blue light has a shorter wavelength and scatters more than red light, so we see a blue sky.",
"rejected": "The sky is blue because that's just how it is. God made it blue or something like that."
}
# Demonstrate all three formats as Python objects
# Format 1: Instruction-Response (Alpaca)
alpaca_example = {
"instruction": "Summarize the following text in one sentence.",
"input": """Machine learning is a branch of artificial intelligence that allows
computers to learn from data without being explicitly programmed. It uses
algorithms to analyze patterns and make decisions with minimal human intervention.""",
"output": "Machine learning is an AI branch where computers learn patterns from data to make decisions automatically."
}
# Format 2: Conversations (ChatML)
chatml_example = {
"conversations": [
{"role": "system", "content": "You are an expert data scientist."},
{"role": "user", "content": "What's the difference between precision and recall?"},
{"role": "assistant", "content": "Precision measures how many of your positive predictions were actually positive (avoiding false alarms). Recall measures how many actual positives you correctly identified (avoiding misses). High precision = few false positives. High recall = few false negatives."},
{"role": "user", "content": "When would I prioritize one over the other?"},
{"role": "assistant", "content": "Prioritize recall when missing a positive is costly (e.g., cancer screening β missing a case is worse than a false alarm). Prioritize precision when false positives are costly (e.g., spam detection β legitimate emails in spam folder is annoying)."}
]
}
# Format 3: Preference Pairs (DPO)
dpo_example = {
"prompt": "How do I get better at programming?",
"chosen": """To improve at programming: (1) Build real projects β don't just follow tutorials.
(2) Read others' code on GitHub. (3) Practice daily, even 30 minutes.
(4) Struggle with problems before looking up answers. (5) Join communities (Reddit, Discord, Stack Overflow).
The key is deliberate practice with feedback loops.""",
"rejected": "Just watch YouTube videos and copy the code. Do a lot of tutorials."
}
print("=== Format 1: Instruction-Response (Alpaca) ===")
print(json.dumps(alpaca_example, indent=2))
print()
print("=== Format 2: Conversations (ChatML) ===")
print(json.dumps(chatml_example, indent=2))
print()
print("=== Format 3: Preference Pairs (DPO) ===")
print(json.dumps(dpo_example, indent=2))
5. Loading Datasets from Hugging Face HubΒΆ
The Hugging Face Hub has thousands of fine-tuning datasets. Here are the top datasets for each use case.
# Popular fine-tuning datasets reference
popular_datasets = {
"Instruction Following": [
("databricks/databricks-dolly-15k", "15K human-written instruction pairs, open license"),
("tatsu-lab/alpaca", "52K GPT-3.5 generated, Alpaca format"),
("HuggingFaceH4/ultrachat_200k", "200K multi-turn conversations (filtered)"),
("teknium/OpenHermes-2.5", "1M+ GPT-4 quality instruction pairs"),
],
"Coding": [
("iamtarun/python_code_instructions_18k_alpaca", "18K Python coding instructions"),
("codeparrot/github-code", "Large code corpus for continual pre-training"),
("bigcode/the-stack-smol", "Permissive-license code, many languages"),
],
"Math & Reasoning": [
("gsm8k", "8.5K grade school math word problems with solutions"),
("lighteval/MATH", "12.5K competition math problems"),
("nvidia/OpenMathInstruct-2", "14M math instruction pairs from Llama 3"),
],
"Alignment (DPO)": [
("Anthropic/hh-rlhf", "Human preference data from Anthropic"),
("HuggingFaceH4/ultrafeedback_binarized", "250K preference pairs, high quality"),
("argilla/dpo-mix-7k", "7K curated DPO pairs"),
],
}
print("Popular Fine-tuning Datasets on Hugging Face Hub")
print("=" * 60)
for category, datasets_list in popular_datasets.items():
print(f"\n{category}:")
for dataset_id, description in datasets_list:
print(f" {dataset_id}")
print(f" -> {description}")
# Load the Dolly dataset β one of the best open-license instruction datasets
print("Loading databricks/databricks-dolly-15k...")
dolly_dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
print(f"Dataset size: {len(dolly_dataset):,} examples")
print(f"Column names: {dolly_dataset.column_names}")
print(f"Dataset type: {type(dolly_dataset)}")
print()
# Show category distribution
categories = Counter(dolly_dataset["category"])
print("Category distribution:")
for cat, count in sorted(categories.items(), key=lambda x: -x[1]):
bar = "#" * (count // 100)
print(f" {cat:<35} {count:>5} {bar}")
# Explore the dataset structure
print("=" * 60)
print("Sample Examples from Dolly Dataset")
print("=" * 60)
# Show 3 diverse examples
example_categories = ["open_qa", "creative_writing", "summarization"]
for target_cat in example_categories:
# Find first example of this category
example = next(
(ex for ex in dolly_dataset if ex["category"] == target_cat),
None
)
if example:
print(f"\nCategory: {example['category']}")
print(f"Instruction: {example['instruction'][:150]}..." if len(example['instruction']) > 150 else f"Instruction: {example['instruction']}")
if example['context']:
print(f"Context: {example['context'][:100]}..." if len(example['context']) > 100 else f"Context: {example['context']}")
print(f"Response: {example['response'][:200]}..." if len(example['response']) > 200 else f"Response: {example['response']}")
print("-" * 40)
# Convert to pandas for easy exploration
dolly_df = dolly_dataset.to_pandas()
print("Basic statistics:")
print(dolly_df.describe(include="all"))
print()
# Compute text lengths
dolly_df["instruction_len"] = dolly_df["instruction"].str.len()
dolly_df["response_len"] = dolly_df["response"].str.len()
dolly_df["context_len"] = dolly_df["context"].str.len()
print("Text length statistics (characters):")
print(dolly_df[["instruction_len", "response_len", "context_len"]].describe().round(0))
6. Data Cleaning: DeduplicationΒΆ
Duplicate examples in training data cause the model to over-weight certain patterns, leading to memorization artifacts and reduced generalization. Always deduplicate before training.
# --- Exact Deduplication ---
# Hash each example and remove exact duplicates
def exact_dedup(dataset, text_field="instruction"):
"""
Remove exact duplicate examples by hashing a key field.
Args:
dataset: HuggingFace Dataset
text_field: field to use for deduplication key
Returns:
Deduplicated Dataset
"""
seen_hashes = set()
keep_indices = []
for idx, example in enumerate(dataset):
# Create a hash from the instruction field
text = example[text_field].strip().lower()
text_hash = hashlib.md5(text.encode()).hexdigest()
if text_hash not in seen_hashes:
seen_hashes.add(text_hash)
keep_indices.append(idx)
removed = len(dataset) - len(keep_indices)
print(f"Exact deduplication:")
print(f" Before: {len(dataset):,} examples")
print(f" After: {len(keep_indices):,} examples")
print(f" Removed: {removed:,} exact duplicates ({removed/len(dataset)*100:.1f}%)")
return dataset.select(keep_indices)
# Apply to dolly dataset
dolly_deduped = exact_dedup(dolly_dataset, text_field="instruction")
print()
# --- Near-Duplicate Detection (MinHash sketch) ---
# For true production use, install: pip install datasketch
# Below is a lightweight n-gram based similarity approach
def compute_ngram_fingerprint(text, n=5, num_shingles=100):
"""
Create a character n-gram fingerprint for fuzzy deduplication.
A lightweight alternative to MinHash for small datasets.
"""
text = text.lower().strip()
# Generate character n-grams
ngrams = [text[i:i+n] for i in range(len(text) - n + 1)]
if not ngrams:
return frozenset()
# Sample for efficiency on long texts
if len(ngrams) > num_shingles:
ngrams = random.sample(ngrams, num_shingles)
return frozenset(ngrams)
def jaccard_similarity(set_a, set_b):
"""Compute Jaccard similarity between two sets."""
if not set_a or not set_b:
return 0.0
intersection = len(set_a & set_b)
union = len(set_a | set_b)
return intersection / union
def fuzzy_dedup_small(dataset, text_field="instruction", threshold=0.8, max_comparisons=5000):
"""
Near-duplicate removal for smaller datasets (<10K examples).
For large datasets, use MinHash LSH (datasketch library).
Args:
threshold: Jaccard similarity above this = considered duplicate (0.8 = 80% similar)
"""
# Limit for demo purposes
sample_size = min(len(dataset), max_comparisons)
# Compute fingerprints
fingerprints = [
compute_ngram_fingerprint(dataset[i][text_field])
for i in range(sample_size)
]
keep = list(range(sample_size))
removed_as_dup = set()
for i in range(sample_size):
if i in removed_as_dup:
continue
for j in range(i + 1, min(i + 50, sample_size)): # Compare with next 50 only
if j in removed_as_dup:
continue
sim = jaccard_similarity(fingerprints[i], fingerprints[j])
if sim >= threshold:
removed_as_dup.add(j)
keep_indices = [i for i in range(sample_size) if i not in removed_as_dup]
removed = sample_size - len(keep_indices)
print(f"Fuzzy deduplication (threshold={threshold}):")
print(f" Analyzed: {sample_size:,} examples")
print(f" Near-duplicates removed: {removed:,} ({removed/sample_size*100:.1f}%)")
print(f" Kept: {len(keep_indices):,} examples")
print()
print("For production with large datasets, use datasketch MinHash LSH:")
print(" from datasketch import MinHash, MinHashLSH")
return dataset.select(keep_indices)
# Demo fuzzy dedup on a small sample
dolly_fuzzy_deduped = fuzzy_dedup_small(
dolly_deduped.select(range(1000)), # Demo on 1000 examples
text_field="instruction",
threshold=0.85
)
7. Data Cleaning: Quality FilteringΒΆ
# Quality filtering functions
def is_too_short(text, min_chars=20):
"""Flag examples with very short responses."""
return len(text.strip()) < min_chars
def is_too_long(text, max_chars=8000):
"""Flag examples with very long responses (often data errors)."""
return len(text.strip()) > max_chars
def has_high_repetition(text, threshold=0.3):
"""
Detect repetitive text (a common sign of model-generated garbage).
Returns True if the text has too many repeated n-grams.
"""
words = text.lower().split()
if len(words) < 10:
return False
# Count bigrams
bigrams = [(words[i], words[i+1]) for i in range(len(words)-1)]
bigram_counts = Counter(bigrams)
# If any bigram appears more than 30% of the time, flag as repetitive
most_common_count = bigram_counts.most_common(1)[0][1] if bigram_counts else 0
return most_common_count / len(bigrams) > threshold
def has_encoding_artifacts(text):
"""
Detect common encoding artifacts from web scraping.
"""
artifacts = [
r"Γ’β¬", # UTF-8 misinterpretation
r"&", # HTML entities
r"<",
r">",
r"\\n\\n\\n", # Escaped newlines (not real newlines)
r"</?[a-z]+>", # Leftover HTML tags
]
for pattern in artifacts:
if re.search(pattern, text):
return True
return False
def is_mostly_non_ascii(text, threshold=0.3):
"""
Detect text that is mostly non-ASCII characters.
Useful for language filtering when training English models.
"""
if not text:
return False
non_ascii = sum(1 for c in text if ord(c) > 127)
return non_ascii / len(text) > threshold
def quality_filter(example, response_field="response", instruction_field="instruction"):
"""
Combined quality filter. Returns True if the example passes all checks.
Use with dataset.filter(quality_filter).
"""
instruction = example.get(instruction_field, "")
response = example.get(response_field, "")
# Skip empty examples
if not instruction or not response:
return False
# Response quality checks
if is_too_short(response, min_chars=30):
return False
if is_too_long(response, max_chars=10000):
return False
if has_high_repetition(response):
return False
if has_encoding_artifacts(response):
return False
# Instruction quality checks
if is_too_short(instruction, min_chars=10):
return False
return True
# Apply to dolly dataset
print(f"Before quality filtering: {len(dolly_deduped):,} examples")
dolly_filtered = dolly_deduped.filter(quality_filter)
print(f"After quality filtering: {len(dolly_filtered):,} examples")
removed = len(dolly_deduped) - len(dolly_filtered)
print(f"Removed: {removed:,} low-quality examples ({removed/len(dolly_deduped)*100:.1f}%)")
8. Data Cleaning: Length Filtering and Token AnalysisΒΆ
LLMs process text as tokens, not characters. You need to filter by token length to:
Remove examples that exceed your modelβs context window
Remove micro-responses (too short to be useful)
Understand your dataβs distribution before setting
max_seq_length
# Load a tokenizer for length analysis
# GPT-2 tokenizer is a reasonable proxy for most modern models
# In production, use the actual model's tokenizer
print("Loading tokenizer for length analysis...")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
print(f"Tokenizer loaded: {tokenizer.__class__.__name__}")
print()
# Compute token counts for a sample of the dataset
SAMPLE_SIZE = 2000 # Use a sample for speed
sample = dolly_filtered.select(range(min(SAMPLE_SIZE, len(dolly_filtered))))
def count_tokens(example, tokenizer, field="response"):
"""Count tokens in a text field."""
text = example.get(field, "")
tokens = tokenizer.encode(text, add_special_tokens=False)
return {f"{field}_tokens": len(tokens)}
print(f"Computing token counts for {len(sample):,} examples...")
sample = sample.map(
lambda ex: count_tokens(ex, tokenizer, "instruction"),
desc="Counting instruction tokens"
)
sample = sample.map(
lambda ex: count_tokens(ex, tokenizer, "response"),
desc="Counting response tokens"
)
instruction_tokens = sample["instruction_tokens"]
response_tokens = sample["response_tokens"]
# Add combined length (instruction + response + some overhead)
total_tokens = [i + r + 10 for i, r in zip(instruction_tokens, response_tokens)]
print("\nToken Length Statistics:")
print(f"{'Metric':<25} {'Instruction':>12} {'Response':>12} {'Total':>12}")
print("-" * 65)
for stat_name, stat_fn in [("Mean", np.mean), ("Median", np.median), ("Std Dev", np.std),
("Min", np.min), ("Max", np.max),
("95th percentile", lambda x: np.percentile(x, 95)),
("99th percentile", lambda x: np.percentile(x, 99))]:
print(f"{stat_name:<25} {stat_fn(instruction_tokens):>12.0f} {stat_fn(response_tokens):>12.0f} {stat_fn(total_tokens):>12.0f}")
print()
# Recommend max_seq_length
p95 = np.percentile(total_tokens, 95)
p99 = np.percentile(total_tokens, 99)
print(f"Recommendation:")
print(f" max_seq_length = {int(p95):,} (covers 95% of examples, lower VRAM usage)")
print(f" max_seq_length = {int(p99):,} (covers 99% of examples, higher VRAM usage)")
# Apply length filtering
MAX_TOKENS = 2048 # Set based on your GPU memory and model context window
MIN_TOKENS = 10 # Filter out micro-responses
def length_filter(example, max_tokens=MAX_TOKENS, min_tokens=MIN_TOKENS):
"""
Filter examples by total token count.
Uses character count as a proxy (avoids tokenizing every example).
Roughly 1 token β 4 characters for English text.
"""
instruction = example.get("instruction", "")
response = example.get("response", "")
context = example.get("context", "")
# Character-based estimate (fast, no tokenizer needed)
total_chars = len(instruction) + len(response) + len(context)
estimated_tokens = total_chars // 4
response_chars = len(response)
estimated_response_tokens = response_chars // 4
return (
estimated_tokens <= max_tokens and
estimated_response_tokens >= min_tokens
)
print(f"Before length filtering: {len(dolly_filtered):,} examples")
dolly_length_filtered = dolly_filtered.filter(length_filter)
print(f"After length filtering: {len(dolly_length_filtered):,} examples")
removed = len(dolly_filtered) - len(dolly_length_filtered)
print(f"Removed: {removed:,} examples outside length bounds")
9. Creating Your Own Custom DatasetΒΆ
Often you will need to create a dataset from scratch for your specific use case. Here are the main approaches.
# Method 1: Create from a list of dictionaries
custom_examples = [
{
"instruction": "What is the capital of France?",
"input": "",
"output": "The capital of France is Paris."
},
{
"instruction": "Write a Python function to check if a number is prime.",
"input": "",
"output": """def is_prime(n):
if n < 2:
return False
for i in range(2, int(n**0.5) + 1):
if n % i == 0:
return False
return True"""
},
{
"instruction": "Classify the sentiment of the following review.",
"input": "The product arrived on time and works perfectly. Very happy with my purchase!",
"output": "Positive"
},
{
"instruction": "Translate to Spanish.",
"input": "Good morning, how are you today?",
"output": "Buenos dΓas, ΒΏcΓ³mo estΓ‘s hoy?"
},
{
"instruction": "Summarize the main idea in one sentence.",
"input": "Neural networks are computing systems inspired by biological neural networks. They learn to perform tasks by analyzing training examples, without being explicitly programmed with task-specific rules.",
"output": "Neural networks are brain-inspired computing systems that learn from examples rather than explicit programming."
},
]
custom_dataset = Dataset.from_list(custom_examples)
print(f"Custom dataset created: {len(custom_dataset)} examples")
print(f"Columns: {custom_dataset.column_names}")
print()
print("First example:")
print(json.dumps(custom_dataset[0], indent=2))
# Method 2: Create from a pandas DataFrame
import pandas as pd
# Simulate reading from a CSV/spreadsheet
data = {
"input_text": [
"Customer: My order hasn't arrived after 2 weeks.",
"Customer: I received the wrong item in my package.",
"Customer: Can I return this product for a refund?",
"Customer: The product is defective and stopped working.",
"Customer: I was charged twice for the same order.",
],
"expected_output": [
"I sincerely apologize for the delay! I'll immediately check your tracking number and escalate this to our shipping team. You'll receive an update within 24 hours, and if the item is lost, we'll send a replacement at no charge.",
"I'm sorry you received the wrong item! Please keep it, and I'll arrange for the correct item to be shipped to you today with express delivery at no additional cost.",
"Absolutely! Our return policy allows returns within 30 days of purchase. I'll email you a prepaid return label, and your refund will be processed within 3-5 business days of receiving the item.",
"I apologize for the defective product! This is covered under our warranty. I'll send you a replacement unit with priority shipping, or if you prefer, issue a full refund. Which would you like?",
"I'm very sorry about the duplicate charge! This was an error on our end. I'll process a refund for the duplicate charge immediately, and it should appear on your statement within 2-3 business days.",
],
"category": ["shipping", "wrong_item", "returns", "defective", "billing"]
}
df = pd.DataFrame(data)
# Convert to the instruction-response format
df["instruction"] = "You are a helpful customer service agent. Respond to the following customer complaint professionally and empathetically."
df["input"] = df["input_text"]
df["output"] = df["expected_output"]
# Create HuggingFace Dataset from DataFrame
customer_service_dataset = Dataset.from_pandas(df[["instruction", "input", "output", "category"]])
print(f"Customer service dataset: {len(customer_service_dataset)} examples")
print(f"Columns: {customer_service_dataset.column_names}")
print()
print("Example entry:")
example = customer_service_dataset[0]
print(f" Instruction: {example['instruction'][:80]}...")
print(f" Input: {example['input']}")
print(f" Output: {example['output'][:100]}...")
# Method 3: Load from JSONL file (common format for fine-tuning data)
import tempfile
import os
# Create a sample JSONL file
sample_jsonl_data = [
{"instruction": "What is machine learning?", "output": "Machine learning is a subset of AI where systems learn from data."},
{"instruction": "Explain gradient descent.", "output": "Gradient descent is an optimization algorithm that iteratively moves toward the minimum of a loss function."},
{"instruction": "What is overfitting?", "output": "Overfitting is when a model learns the training data too well, including noise, causing poor performance on new data."},
]
# Write to a temporary JSONL file
with tempfile.NamedTemporaryFile(mode='w', suffix='.jsonl', delete=False) as f:
for item in sample_jsonl_data:
f.write(json.dumps(item) + '\n')
jsonl_path = f.name
# Load from JSONL
dataset_from_jsonl = load_dataset("json", data_files=jsonl_path, split="train")
print(f"Loaded from JSONL: {len(dataset_from_jsonl)} examples")
print(f"Columns: {dataset_from_jsonl.column_names}")
# Clean up temp file
os.unlink(jsonl_path)
print("\nAll dataset creation methods demonstrated successfully.")
10. Data Augmentation TechniquesΒΆ
When you have a small dataset, augmentation can increase diversity and reduce overfitting. LLM-based augmentation is the most effective approach in 2026.
# Technique 1: Instruction Paraphrasing
# Generate multiple phrasings of the same instruction to increase diversity
def paraphrase_instruction(instruction, num_variations=3):
"""
Generate paraphrased versions of an instruction using template patterns.
In production, use an LLM (e.g., GPT-4, Claude) to generate better paraphrases:
prompt = f'Paraphrase this instruction {num_variations} different ways: {instruction}'
"""
# Template-based paraphrasing patterns
prefix_patterns = [
"Please {}",
"Can you {}?",
"I need you to {}",
"Your task is to {}",
"Help me {}",
]
# Simple heuristic: lower the instruction and add different prefixes
base = instruction.rstrip('.').rstrip('?').lower()
# Remove existing imperative prefixes
for prefix in ["please ", "can you ", "help me ", "write ", "create ", "generate "]:
if base.startswith(prefix):
base = base[len(prefix):]
break
variations = [instruction] # Always keep original
used_patterns = random.sample(prefix_patterns, min(num_variations, len(prefix_patterns)))
for pattern in used_patterns:
variation = pattern.format(base)
if variation != instruction:
variations.append(variation)
return variations[:num_variations + 1]
# Demonstrate
test_instruction = "Explain the concept of recursion in programming."
variations = paraphrase_instruction(test_instruction, num_variations=4)
print("Instruction paraphrasing example:")
print(f" Original: {variations[0]}")
for i, var in enumerate(variations[1:], 1):
print(f" Variation {i}: {var}")
# Technique 2: Augment dataset by expanding with paraphrased instructions
def augment_with_paraphrases(dataset, instruction_field="instruction",
response_field="output", num_variations=2):
"""
Expand dataset by adding paraphrased instruction variants.
Each original example generates num_variations additional examples.
"""
augmented_examples = []
for example in dataset:
original_instruction = example[instruction_field]
response = example[response_field]
# Keep original
augmented_examples.append({
"instruction": original_instruction,
"output": response,
"augmented": False
})
# Add paraphrased versions
variations = paraphrase_instruction(original_instruction, num_variations=num_variations)
for var in variations[1:][:num_variations]: # Skip original, take variations
augmented_examples.append({
"instruction": var,
"output": response,
"augmented": True
})
return Dataset.from_list(augmented_examples)
# Demo on small subset
small_sample = custom_dataset.select(range(min(5, len(custom_dataset))))
augmented = augment_with_paraphrases(small_sample, response_field="output")
print(f"Before augmentation: {len(small_sample)} examples")
print(f"After augmentation: {len(augmented)} examples")
print(f"Augmented examples: {sum(augmented['augmented'])}")
print()
print("Note: For production, use LLM-based paraphrasing (GPT-4, Claude API)")
print("for higher quality variations. Template paraphrasing is a quick baseline.")
# Technique 3: LLM-assisted data generation (production approach)
# This shows the pattern for using an LLM to generate training data
# Requires an API key - shown as a template
def generate_training_examples_with_llm(topic, num_examples=5, openai_client=None):
"""
Generate training examples using an LLM.
This is the production approach for building high-quality datasets.
Args:
topic: The topic/domain for example generation
num_examples: How many examples to generate
openai_client: Initialized OpenAI client (or None for demo)
"""
generation_prompt = f"""
Generate {num_examples} high-quality instruction-response pairs for fine-tuning a language model.
Topic: {topic}
Requirements:
- Instructions should be clear and unambiguous
- Responses should be accurate, helpful, and appropriately detailed
- Vary the complexity and style of instructions
- Include both simple and complex examples
Output as JSON array with format:
[{{"instruction": "...", "response": "..."}}]
"""
if openai_client is None:
# Demo mode: return synthetic examples
print(f"[Demo mode - would call LLM with prompt:]")
print(generation_prompt[:300], "...")
return [
{"instruction": f"Example instruction about {topic} #{i+1}",
"response": f"Example response about {topic} #{i+1}"}
for i in range(num_examples)
]
else:
# Real LLM call
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": generation_prompt}],
temperature=0.8
)
content = response.choices[0].message.content
# Parse JSON from response
json_match = re.search(r'\[.*\]', content, re.DOTALL)
if json_match:
return json.loads(json_match.group())
return []
# Demo the function
generated = generate_training_examples_with_llm("Python pandas data manipulation", num_examples=3)
print(f"\nGenerated {len(generated)} examples")
print("\nIn production:")
print(" 1. Use GPT-4, Claude, or another capable model as the data generator")
print(" 2. Generate 1000-5000 examples per topic")
print(" 3. Always review a sample manually before training")
print(" 4. Filter generated data with the same quality checks as human data")
11. Formatting Functions for Different Model FamiliesΒΆ
Each model family uses a different chat template. You must format your data to match the exact template the model was trained with β using a different format degrades performance significantly.
# Manual formatting functions for major model families
# These are useful for understanding the formats, but in practice
# use tokenizer.apply_chat_template() (shown in the next cell)
def format_for_llama3(instruction, response, system_prompt=None, input_text=None):
"""
Format for Llama 3.x models (Meta-Llama-3-8B-Instruct, Llama-3.1-*, Llama-3.3-*)
"""
system = system_prompt or "You are a helpful, harmless, and honest assistant."
user_content = f"{instruction}\n\n{input_text}" if input_text else instruction
return (
f"<|begin_of_text|>"
f"<|start_header_id|>system<|end_header_id|>\n\n{system}<|eot_id|>"
f"<|start_header_id|>user<|end_header_id|>\n\n{user_content}<|eot_id|>"
f"<|start_header_id|>assistant<|end_header_id|>\n\n{response}<|eot_id|>"
)
def format_for_qwen(instruction, response, system_prompt=None, input_text=None):
"""
Format for Qwen 2.5 models (uses ChatML format)
Also works for: OpenHermes, Mistral (v0.3+), and many others
"""
system = system_prompt or "You are a helpful assistant."
user_content = f"{instruction}\n\n{input_text}" if input_text else instruction
return (
f"<|im_start|>system\n{system}<|im_end|>\n"
f"<|im_start|>user\n{user_content}<|im_end|>\n"
f"<|im_start|>assistant\n{response}<|im_end|>"
)
def format_for_phi4(instruction, response, system_prompt=None, input_text=None):
"""
Format for Phi-4 (Microsoft)
"""
system = system_prompt or "You are a helpful AI assistant."
user_content = f"{instruction}\n\n{input_text}" if input_text else instruction
return (
f"<|system|>\n{system}<|end|>\n"
f"<|user|>\n{user_content}<|end|>\n"
f"<|assistant|>\n{response}<|end|>"
)
def format_for_gemma2(instruction, response, input_text=None):
"""
Format for Gemma 2 (Google)
Note: Gemma 2 does not use a system prompt in its standard template
"""
user_content = f"{instruction}\n\n{input_text}" if input_text else instruction
return (
f"<start_of_turn>user\n{user_content}<end_of_turn>\n"
f"<start_of_turn>model\n{response}<end_of_turn>"
)
# Demonstrate all formats with the same example
test_instruction = "What is the difference between a list and a tuple in Python?"
test_response = "Lists are mutable (can be changed), while tuples are immutable (cannot be changed after creation). Lists use square brackets [], tuples use parentheses (). Tuples are slightly faster and use less memory."
formatters = [
("Llama 3.x", format_for_llama3),
("Qwen 2.5 / ChatML", format_for_qwen),
("Phi-4", format_for_phi4),
("Gemma 2", format_for_gemma2),
]
for model_name, formatter in formatters:
print(f"\n{'='*60}")
print(f"Format: {model_name}")
print("='*60")
if model_name == "Gemma 2":
formatted = formatter(test_instruction, test_response)
else:
formatted = formatter(test_instruction, test_response)
print(formatted)
# The RECOMMENDED approach: use tokenizer.apply_chat_template()
# This is the modern, correct way - it uses the exact template the model was designed for
def format_with_chat_template(example, tokenizer, response_field="response",
instruction_field="instruction",
input_field="input",
system_prompt=None):
"""
Format a dataset example using the tokenizer's built-in chat template.
This is the correct approach for any Hugging Face model.
Args:
example: Dataset example dict
tokenizer: The model's tokenizer with chat_template defined
response_field: Name of the response/output field
instruction_field: Name of the instruction field
input_field: Name of the input context field
system_prompt: Optional system prompt string
Returns:
Dict with 'text' key containing formatted string
"""
instruction = example.get(instruction_field, "")
response = example.get(response_field, "")
input_text = example.get(input_field, "")
# Build user content
if input_text and input_text.strip():
user_content = f"{instruction}\n\n{input_text}"
else:
user_content = instruction
# Build messages list
messages = []
if system_prompt:
messages.append({"role": "system", "content": system_prompt})
messages.append({"role": "user", "content": user_content})
messages.append({"role": "assistant", "content": response})
# Apply the tokenizer's chat template
text = tokenizer.apply_chat_template(
messages,
tokenize=False, # Return string, not token IDs
add_generation_prompt=False # Include the assistant response (for training)
)
return {"text": text}
# Demo with GPT-2 tokenizer (which has a basic template)
# In production, use the actual model tokenizer:
# tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
print("apply_chat_template() demo:")
print()
print("In production, load the actual model's tokenizer:")
print(" # For Qwen 2.5:")
print(" tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-7B-Instruct')")
print()
print(" # For Llama 3.3:")
print(" tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.3-70B-Instruct')")
print()
print(" # Then apply to dataset:")
print(" dataset = dataset.map(lambda ex: format_with_chat_template(ex, tokenizer))")
print()
print("The tokenizer's chat_template ensures correct special tokens are used.")
print("Never manually construct chat templates for models you plan to use in production.")
# Demonstrate the messages structure
sample_messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain gradient descent in one paragraph."},
{"role": "assistant", "content": "Gradient descent is an optimization algorithm that iteratively adjusts model parameters in the direction that reduces a loss function."}
]
print()
print("Messages format (model-agnostic):")
print(json.dumps(sample_messages, indent=2))
12. Train/Validation Split Best PracticesΒΆ
# Best practices for train/validation splitting
def create_splits(dataset, train_size=0.9, val_size=0.1, seed=42, stratify_field=None):
"""
Create train/validation splits with optional stratification.
Args:
dataset: HuggingFace Dataset
train_size: Fraction for training (default 0.9)
val_size: Fraction for validation (default 0.1)
seed: Random seed for reproducibility
stratify_field: Column name to stratify by (ensures balanced categories)
Returns:
DatasetDict with 'train' and 'validation' keys
"""
assert abs(train_size + val_size - 1.0) < 1e-6, "train_size + val_size must equal 1.0"
if stratify_field and stratify_field in dataset.column_names:
# Stratified split: preserve category proportions in both splits
# Get unique categories
categories = list(set(dataset[stratify_field]))
train_indices = []
val_indices = []
rng = random.Random(seed)
for cat in categories:
# Get all indices for this category
cat_indices = [i for i, ex in enumerate(dataset) if ex[stratify_field] == cat]
rng.shuffle(cat_indices)
# Split proportionally
split_idx = max(1, int(len(cat_indices) * train_size))
train_indices.extend(cat_indices[:split_idx])
val_indices.extend(cat_indices[split_idx:])
# Shuffle final indices
rng.shuffle(train_indices)
rng.shuffle(val_indices)
return DatasetDict({
"train": dataset.select(train_indices),
"validation": dataset.select(val_indices)
})
else:
# Simple random split
split = dataset.train_test_split(
test_size=val_size,
seed=seed
)
return DatasetDict({
"train": split["train"],
"validation": split["test"]
})
# Apply to dolly dataset with stratification
print("Creating stratified train/validation split...")
dataset_splits = create_splits(
dolly_length_filtered,
train_size=0.9,
val_size=0.1,
seed=42,
stratify_field="category"
)
print(f"Total examples: {len(dolly_length_filtered):,}")
print(f"Train split: {len(dataset_splits['train']):,} ({len(dataset_splits['train'])/len(dolly_length_filtered)*100:.1f}%)")
print(f"Val split: {len(dataset_splits['validation']):,} ({len(dataset_splits['validation'])/len(dolly_length_filtered)*100:.1f}%)")
print()
# Verify category distribution is maintained
print("Category distribution in splits:")
train_cats = Counter(dataset_splits['train']['category'])
val_cats = Counter(dataset_splits['validation']['category'])
all_cats = sorted(set(list(train_cats.keys()) + list(val_cats.keys())))
print(f"{'Category':<35} {'Train':>8} {'Val':>8} {'Train%':>8}")
print("-" * 62)
for cat in all_cats:
t = train_cats.get(cat, 0)
v = val_cats.get(cat, 0)
total = t + v
pct = t / total * 100 if total > 0 else 0
print(f"{cat:<35} {t:>8} {v:>8} {pct:>7.1f}%")
# Validation set size guidelines
split_guidelines = [
("< 500 examples", "20% val (100 min)", "You need enough validation examples to get reliable metrics"),
("500β2,000 examples", "15% val", "Balance between training data and reliable evaluation"),
("2Kβ10K examples", "10% val", "Standard split for most fine-tuning tasks"),
("10Kβ100K examples", "5% val", "Large enough training set, smaller val% is OK"),
("> 100K examples", "1β2% val", "Even 1K validation examples is plenty for metrics"),
]
print("Train/Validation Split Guidelines:")
print()
print(f"{'Dataset Size':<25} {'Val Split':>12} Notes")
print("-" * 75)
for size, split, notes in split_guidelines:
print(f"{size:<25} {split:>12} {notes}")
print()
print("IMPORTANT: Never evaluate on training data.")
print("IMPORTANT: Choose validation examples that represent real-world usage.")
print("IMPORTANT: Keep test set completely separate (do not tune on it).")
13. Dataset Statistics and VisualizationΒΆ
# Comprehensive dataset visualization
fig = plt.figure(figsize=(16, 12))
fig.suptitle("Dataset Analysis: Databricks Dolly-15K (Processed)", fontsize=15, fontweight="bold", y=0.98)
gs = gridspec.GridSpec(2, 3, figure=fig, hspace=0.4, wspace=0.35)
# Use the filtered dataset for visualization
viz_df = dolly_length_filtered.to_pandas()
viz_df["instruction_chars"] = viz_df["instruction"].str.len()
viz_df["response_chars"] = viz_df["response"].str.len()
viz_df["context_chars"] = viz_df["context"].str.len()
# Estimate token lengths
viz_df["instruction_tokens"] = (viz_df["instruction_chars"] / 4).astype(int)
viz_df["response_tokens"] = (viz_df["response_chars"] / 4).astype(int)
viz_df["total_tokens"] = viz_df["instruction_tokens"] + viz_df["response_tokens"]
# --- Plot 1: Response token length distribution ---
ax1 = fig.add_subplot(gs[0, 0])
ax1.hist(viz_df["response_tokens"], bins=50, color="steelblue", alpha=0.85, edgecolor="white")
ax1.axvline(viz_df["response_tokens"].median(), color="red", linestyle="--", linewidth=1.5,
label=f"Median: {viz_df['response_tokens'].median():.0f}")
ax1.axvline(viz_df["response_tokens"].quantile(0.95), color="orange", linestyle=":",
linewidth=1.5, label=f"P95: {viz_df['response_tokens'].quantile(0.95):.0f}")
ax1.set_xlabel("Response Length (est. tokens)")
ax1.set_ylabel("Count")
ax1.set_title("Response Length Distribution")
ax1.legend(fontsize=8)
# --- Plot 2: Instruction token length distribution ---
ax2 = fig.add_subplot(gs[0, 1])
ax2.hist(viz_df["instruction_tokens"], bins=40, color="seagreen", alpha=0.85, edgecolor="white")
ax2.axvline(viz_df["instruction_tokens"].median(), color="red", linestyle="--", linewidth=1.5,
label=f"Median: {viz_df['instruction_tokens'].median():.0f}")
ax2.set_xlabel("Instruction Length (est. tokens)")
ax2.set_ylabel("Count")
ax2.set_title("Instruction Length Distribution")
ax2.legend(fontsize=8)
# --- Plot 3: Category distribution (horizontal bar) ---
ax3 = fig.add_subplot(gs[0, 2])
cat_counts = viz_df["category"].value_counts()
colors = plt.cm.Set3(np.linspace(0, 1, len(cat_counts)))
bars = ax3.barh(range(len(cat_counts)), cat_counts.values, color=colors)
ax3.set_yticks(range(len(cat_counts)))
ax3.set_yticklabels(cat_counts.index, fontsize=8)
ax3.set_xlabel("Number of Examples")
ax3.set_title("Category Distribution")
for bar, val in zip(bars, cat_counts.values):
ax3.text(bar.get_width() + 5, bar.get_y() + bar.get_height()/2,
str(val), va='center', fontsize=7)
# --- Plot 4: Total token length (training sequence length) ---
ax4 = fig.add_subplot(gs[1, 0])
ax4.hist(viz_df["total_tokens"], bins=50, color="coral", alpha=0.85, edgecolor="white")
for pct, color, label in [(50, "blue", "P50"), (90, "orange", "P90"), (95, "red", "P95")]:
val = viz_df["total_tokens"].quantile(pct/100)
ax4.axvline(val, color=color, linestyle="--", linewidth=1.5,
label=f"{label}: {val:.0f} tokens")
ax4.set_xlabel("Total Sequence Length (est. tokens)")
ax4.set_ylabel("Count")
ax4.set_title("Total Sequence Length\n(Instruction + Response)")
ax4.legend(fontsize=7)
# --- Plot 5: Scatter: instruction length vs response length ---
ax5 = fig.add_subplot(gs[1, 1])
sample_mask = viz_df.sample(min(1000, len(viz_df)), random_state=42).index
ax5.scatter(
viz_df.loc[sample_mask, "instruction_tokens"],
viz_df.loc[sample_mask, "response_tokens"],
alpha=0.3, s=10, color="purple"
)
ax5.set_xlabel("Instruction Tokens")
ax5.set_ylabel("Response Tokens")
ax5.set_title("Instruction vs Response Length\n(1K random sample)")
# --- Plot 6: Context length distribution (examples with context) ---
ax6 = fig.add_subplot(gs[1, 2])
has_context = viz_df[viz_df["context_chars"] > 0]
no_context_count = len(viz_df) - len(has_context)
pie_data = [len(has_context), no_context_count]
pie_labels = [f"With context\n({len(has_context):,})", f"No context\n({no_context_count:,})"]
ax6.pie(pie_data, labels=pie_labels, colors=["steelblue", "lightgray"],
autopct="%1.1f%%", startangle=90)
ax6.set_title("Examples With / Without Context")
plt.savefig("/tmp/dataset_analysis.png", dpi=120, bbox_inches="tight")
plt.show()
print("Dataset analysis visualization complete.")
print("Saved to /tmp/dataset_analysis.png")
# Print a comprehensive statistics summary
def print_dataset_summary(dataset, name="Dataset", response_field="response",
instruction_field="instruction"):
"""
Print a formatted summary of dataset statistics.
"""
df = dataset.to_pandas() if not isinstance(dataset, pd.DataFrame) else dataset
print(f"\n{'='*60}")
print(f"DATASET SUMMARY: {name}")
print(f"{'='*60}")
print(f"Total examples: {len(df):,}")
print(f"Columns: {list(df.columns)}")
if response_field in df.columns:
resp_chars = df[response_field].str.len()
resp_tokens = (resp_chars / 4).astype(int)
print(f"\nResponse Statistics:")
print(f" Avg length: {resp_chars.mean():.0f} chars / {resp_tokens.mean():.0f} tokens")
print(f" Median length: {resp_chars.median():.0f} chars / {resp_tokens.median():.0f} tokens")
print(f" Min length: {resp_chars.min():.0f} chars / {resp_tokens.min():.0f} tokens")
print(f" Max length: {resp_chars.max():.0f} chars / {resp_tokens.max():.0f} tokens")
print(f" P95 length: {resp_chars.quantile(0.95):.0f} chars / {resp_tokens.quantile(0.95):.0f} tokens")
if instruction_field in df.columns:
instr_chars = df[instruction_field].str.len()
print(f"\nInstruction Statistics:")
print(f" Avg length: {instr_chars.mean():.0f} chars")
print(f" Median length: {instr_chars.median():.0f} chars")
if "category" in df.columns:
print(f"\nCategory Distribution:")
for cat, count in df["category"].value_counts().items():
pct = count / len(df) * 100
print(f" {cat:<35} {count:>6} ({pct:.1f}%)")
print(f"{'='*60}\n")
print_dataset_summary(dolly_length_filtered, name="Dolly-15K (Processed)")
print_dataset_summary(dataset_splits['train'], name="Train Split")
print_dataset_summary(dataset_splits['validation'], name="Validation Split")
14. Saving and Versioning DatasetsΒΆ
# Saving datasets β multiple formats
import os
SAVE_DIR = "/tmp/finetuning_dataset"
os.makedirs(SAVE_DIR, exist_ok=True)
# --- Method 1: Save as Hugging Face Dataset (Arrow format) ---
# Best for: Fast loading in Python, large datasets, used with .load_from_disk()
save_path_arrow = f"{SAVE_DIR}/dolly_processed"
dataset_splits.save_to_disk(save_path_arrow)
print(f"Saved as Arrow format to: {save_path_arrow}")
print(f" Files: {os.listdir(save_path_arrow)}")
# Reload
from datasets import load_from_disk
reloaded = load_from_disk(save_path_arrow)
print(f" Reloaded successfully: train={len(reloaded['train']):,}, val={len(reloaded['validation']):,}")
print()
# --- Method 2: Save as JSONL (most portable format) ---
# Best for: Sharing with others, version control, inspecting with text editors
def save_as_jsonl(dataset, filepath):
"""Save a HuggingFace Dataset as JSONL."""
with open(filepath, 'w', encoding='utf-8') as f:
for example in dataset:
f.write(json.dumps(example, ensure_ascii=False) + '\n')
size_mb = os.path.getsize(filepath) / (1024 * 1024)
print(f"Saved {len(dataset):,} examples to {filepath} ({size_mb:.2f} MB)")
train_jsonl_path = f"{SAVE_DIR}/train.jsonl"
val_jsonl_path = f"{SAVE_DIR}/validation.jsonl"
save_as_jsonl(dataset_splits['train'], train_jsonl_path)
save_as_jsonl(dataset_splits['validation'], val_jsonl_path)
# Reload from JSONL
reloaded_train = load_dataset("json", data_files=train_jsonl_path, split="train")
print(f"Reloaded from JSONL: {len(reloaded_train):,} examples")
print()
# --- Method 3: Push to Hugging Face Hub ---
# Best for: Sharing publicly, collaboration, dataset versioning
# Requires: huggingface-cli login or HF_TOKEN environment variable
print("Pushing to Hugging Face Hub (template β requires HF token):")
print()
print("""
# First, login:
# huggingface-cli login
# OR set environment variable: export HF_TOKEN=your_token
# Push full DatasetDict:
dataset_splits.push_to_hub(
repo_id="your-username/dolly-15k-processed",
private=True # Set False to make public
)
# Push just the training split:
dataset_splits['train'].push_to_hub(
repo_id="your-username/dolly-15k-processed",
split="train",
private=True
)
# Load it back from Hub:
from datasets import load_dataset
dataset = load_dataset("your-username/dolly-15k-processed")
""")
print()
# --- Method 4: Dataset card / versioning best practices ---
print("Dataset versioning best practices:")
print(" 1. Use semantic versioning: v1.0.0, v1.1.0, v2.0.0")
print(" 2. Document the processing steps (what was filtered, why)")
print(" 3. Save the exact filtering code alongside the data")
print(" 4. Log data statistics (size, length distribution, categories)")
print(" 5. Record the tokenizer used for length filtering")
print(" 6. Store train/val split seeds for reproducibility")
# Save a metadata file alongside the dataset
metadata = {
"version": "1.0.0",
"base_dataset": "databricks/databricks-dolly-15k",
"processing_date": "2026-02-28",
"total_examples": len(dolly_length_filtered),
"train_examples": len(dataset_splits['train']),
"val_examples": len(dataset_splits['validation']),
"train_val_split": "90/10 stratified by category",
"random_seed": 42,
"processing_steps": [
"exact_deduplication",
"quality_filtering_min_response_30_chars",
"quality_filtering_max_response_10000_chars",
"repetition_filter_bigram_threshold_0.3",
"length_filter_max_2048_estimated_tokens",
"stratified_train_val_split"
],
"tokenizer_for_length_estimate": "gpt2 (4 chars per token approximation)"
}
metadata_path = f"{SAVE_DIR}/dataset_metadata.json"
with open(metadata_path, 'w') as f:
json.dump(metadata, f, indent=2)
print(f"\nMetadata saved to: {metadata_path}")
print(json.dumps(metadata, indent=2))
15. Complete Dataset Preparation PipelineΒΆ
Putting it all together into a reusable function.
def prepare_dataset(
dataset_id,
instruction_field="instruction",
response_field="response",
input_field=None,
category_field=None,
min_response_chars=30,
max_total_tokens=2048,
train_size=0.9,
seed=42,
save_dir=None,
):
"""
Complete dataset preparation pipeline.
Applies: deduplication β quality filtering β length filtering β train/val split.
Args:
dataset_id: HuggingFace dataset ID (e.g., 'databricks/databricks-dolly-15k')
instruction_field: Column name for instructions
response_field: Column name for responses/outputs
input_field: Column name for input context (optional)
category_field: Column to stratify splits by (optional)
min_response_chars: Minimum response character length
max_total_tokens: Maximum total token estimate
train_size: Fraction of data for training
seed: Random seed
save_dir: Directory to save processed dataset (optional)
Returns:
DatasetDict with 'train' and 'validation' keys
"""
print(f"[1/6] Loading dataset: {dataset_id}")
raw_dataset = load_dataset(dataset_id, split="train")
print(f" Loaded {len(raw_dataset):,} examples")
print(f"[2/6] Exact deduplication on '{instruction_field}'")
deduped = exact_dedup(raw_dataset, text_field=instruction_field)
print(f"[3/6] Quality filtering")
filtered = deduped.filter(
lambda ex: quality_filter(ex,
response_field=response_field,
instruction_field=instruction_field)
)
print(f" After filtering: {len(filtered):,} examples")
print(f"[4/6] Length filtering (max ~{max_total_tokens} tokens)")
max_chars = max_total_tokens * 4 # chars per token estimate
length_filtered = filtered.filter(
lambda ex: (
len(ex.get(instruction_field, "")) +
len(ex.get(response_field, "")) +
len(ex.get(input_field or "", ""))
) <= max_chars and len(ex.get(response_field, "")) >= min_response_chars
)
print(f" After length filtering: {len(length_filtered):,} examples")
print(f"[5/6] Creating train/validation split (train={train_size:.0%})")
splits = create_splits(
length_filtered,
train_size=train_size,
val_size=1.0 - train_size,
seed=seed,
stratify_field=category_field
)
print(f" Train: {len(splits['train']):,}, Validation: {len(splits['validation']):,}")
if save_dir:
print(f"[6/6] Saving to disk: {save_dir}")
os.makedirs(save_dir, exist_ok=True)
splits.save_to_disk(save_dir)
print(f" Saved successfully")
else:
print(f"[6/6] Skipping save (no save_dir provided)")
print(f"\nDataset preparation complete!")
print(f" Final: {len(splits['train']):,} train + {len(splits['validation']):,} val")
print(f" Original: {len(raw_dataset):,} β Final: {len(splits['train']) + len(splits['validation']):,} ({(len(splits['train']) + len(splits['validation']))/len(raw_dataset)*100:.1f}% retained)")
return splits
# Run the complete pipeline on Dolly
print("Running complete dataset preparation pipeline...")
print()
final_dataset = prepare_dataset(
dataset_id="databricks/databricks-dolly-15k",
instruction_field="instruction",
response_field="response",
input_field="context",
category_field="category",
min_response_chars=30,
max_total_tokens=2048,
train_size=0.9,
seed=42,
save_dir=f"{SAVE_DIR}/final_processed"
)
16. Key TakeawaysΒΆ
What You Learned in This NotebookΒΆ
Data quality beats data quantity β 1,000 excellent examples outperform 100,000 mediocre ones. The LIMA paper proved this. Always manually inspect your data.
Three dataset formats for different objectives:
Instruction-response (Alpaca): for teaching instruction following
Conversations (ChatML): for multi-turn chat
Preference pairs: for DPO alignment
Deduplication is essential β exact hash dedup + optional fuzzy dedup prevents your model from memorizing specific examples.
Always filter by token length before training. Exceeding
max_seq_lengthsilently truncates examples, causing training instability.Use
tokenizer.apply_chat_template()β never manually construct chat templates. Each model family has unique special tokens that must be exact.Stratified splits preserve category balance between train and validation sets β critical for accurate evaluation.
Document everything β save metadata with your dataset so you can reproduce it later.
Common Pitfalls to AvoidΒΆ
Forgetting to set labels to
-100for prompt tokens (you should only train on the response)Using the wrong chat template for your model
Not deduplicating (leads to memorization)
Evaluating on training data
Including PII (personally identifiable information) in training data
Not shuffling before splitting
Next StepsΒΆ
You now have a clean, formatted dataset ready for fine-tuning.
Next: 02_supervised_finetuning.ipynb
You will learn how to:
Set up TRLβs
SFTTrainerfor instruction fine-tuningConfigure training hyperparameters
Monitor training with W&B (Weights & Biases)
Run your first full fine-tuning job
Save and evaluate the fine-tuned model
Zero-to-AI Curriculum β Phase 12: LLM Fine-tuning
Last updated: February 2026