Run this notebook: Open in Colab Open in Kaggle

Phase 12: LLM Fine-tuning — Start Here¶

Welcome to Phase 12 of the Zero-to-AI curriculum. This module covers LLM Fine-tuning — one of the most powerful techniques for adapting large language models to specific tasks, domains, and behaviors.

By the end of this phase, you will be able to:

Decide when fine-tuning is the right approach vs. prompt engineering or RAG
Understand the full landscape of fine-tuning methods (SFT, LoRA, QLoRA, DPO, RLHF)
Prepare high-quality datasets for fine-tuning
Train a fine-tuned model using efficient techniques like LoRA and QLoRA
Evaluate and deploy your fine-tuned models

Prerequisites: Python, basic ML concepts, familiarity with Hugging Face transformers. Complete Phases 1–11 first.

1. What is LLM Fine-tuning?¶

Pre-training is the process of training a model on massive amounts of text data to learn general language patterns. This is what creates models like GPT-4, Llama 3.3, and Qwen 2.5. It costs millions of dollars and requires thousands of GPUs.

Fine-tuning starts from a pre-trained model and continues training on a smaller, task-specific dataset. This allows the model to:

Learn specialized domain knowledge (medical, legal, finance)
Follow a specific format or style consistently
Adopt a particular persona or tone
Refuse harmful requests (alignment/safety fine-tuning)
Improve performance on a narrow task (code generation, SQL, etc.)

The Two-Stage Mental Model¶

Stage 1: Pre-training (done for you)
  Massive web data → Foundation Model
  Cost: $1M–$100M+, weeks of compute

Stage 2: Fine-tuning (what you do)
  Your labeled data → Specialized Model
  Cost: $10–$1000, hours of compute

Types of Fine-tuning¶

Type	What it does	Common Use Case
Instruction Fine-tuning (SFT)	Teaches the model to follow instructions	General-purpose assistants
Domain Adaptation	Injects domain-specific knowledge	Medical/legal/finance models
Task-specific Fine-tuning	Optimizes for one task	SQL generation, classification
Alignment (DPO/RLHF)	Teaches human preferences	Safety, tone, helpfulness
Continual Pre-training	Adds new knowledge to base weights	Keeping models current

2. When to Fine-tune vs. Other Approaches¶

Fine-tuning is powerful but not always the right tool. Use this decision framework:

The Three Approaches Compared¶

Approach	Setup Cost	Latency	Privacy	Consistency	Best When
Prompt Engineering	Very low	Low	Medium	Medium	Exploring capability, prototyping
RAG	Medium	Medium	High	High	Factual Q&A, up-to-date info needed
Fine-tuning	High	Low	Very High	Very High	Style/format/behavior, no internet access, high volume

Decision Tree¶

START
│
├── Can a good prompt solve this? ──YES──> Use Prompt Engineering
│         │
│         NO
│         │
├── Does the task require up-to-date or private documents? ──YES──> Use RAG
│         │
│         NO
│         │
├── Do you need consistent style/format/behavior at scale? ──YES──┐
│         │                                                        │
│         NO                                                       ▼
│         │                                               FINE-TUNING
├── Is data privacy critical (can't send data to API)? ──YES──────┘
│         │
│         NO
│         │
└── Is inference cost at scale a major concern? ──YES──> FINE-TUNING (smaller model)
          │
          NO
          │
          └──> Try Prompt Engineering or RAG first

Green Flags for Fine-tuning¶

You have 500+ high-quality labeled examples
You need a very specific output format every time
Prompt engineering gives inconsistent results
You need to run the model on private/sensitive data
You need to reduce inference costs (smaller fine-tuned model > large prompted model)
You need a specific persona or communication style consistently

Red Flags (Don’t Fine-tune Yet)¶

You have fewer than 100 examples
You haven’t tried few-shot prompting first
The task requires real-time knowledge (use RAG)
You need to update behavior frequently (retraining is expensive)
The base model already does the task well with prompting

3. Fine-tuning Methods: The Full Landscape¶

3.1 Supervised Fine-tuning (SFT)¶

The most basic form of fine-tuning. You provide (input, output) pairs and train the model to predict the output given the input using standard cross-entropy loss.

How it works:

Prepare a dataset of instruction-response pairs
Train the model with teacher forcing (predict next token given previous tokens)
All model weights are updated (full fine-tuning) OR only a small adapter (PEFT)

When to use: Starting point for any fine-tuning project. Always do SFT before DPO/RLHF.

3.2 LoRA (Low-Rank Adaptation)¶

Instead of updating all model weights (which would require 14GB+ for a 7B model), LoRA freezes the original weights and adds small trainable rank-decomposition matrices alongside them.

Key insight: Weight updates during fine-tuning have a low intrinsic rank. So instead of a full matrix W (e.g., 4096×4096 = 16M parameters), you learn two small matrices A (4096×r) and B (r×4096) where r is typically 8–64.

Original: W (frozen)
LoRA adds: W + BA  (only A and B are trained)
Trainable params: 2 × d × r  vs  d × d
Example: r=16, d=4096 → 131K vs 16M (99% reduction!)

Key hyperparameters:

r (rank): 8–64, higher = more expressive but more memory
lora_alpha: scaling factor, often set to r or 2r
target_modules: which layers to apply LoRA to (attention Q,K,V,O matrices)
lora_dropout: regularization, typically 0.05–0.1

When to use: GPU with 16GB+ VRAM, 7B–13B models.

3.3 QLoRA (Quantized LoRA)¶

QLoRA combines LoRA with 4-bit quantization of the base model weights. This dramatically reduces VRAM requirements:

Full fine-tune 7B model: ~56GB VRAM
LoRA 7B model:           ~28GB VRAM
QLoRA 7B model:          ~6-8GB VRAM  ← fits on a single consumer GPU!

How it works:

Load base model in 4-bit (NF4 quantization via bitsandbytes)
Add LoRA adapters in full precision (16-bit)
Compute gradients in 16-bit but keep base weights in 4-bit

Quality trade-off: Typically 1–3% performance drop vs full fine-tuning. Often acceptable for the 8× memory savings.

When to use: Consumer GPU (RTX 3090/4090), Colab, Kaggle. 7B–70B models.

3.4 DPO (Direct Preference Optimization)¶

After SFT, the model can generate responses but may not know which responses humans prefer. DPO trains on preference data (chosen vs. rejected response pairs) to align the model with human preferences.

Dataset format:

{
  "prompt": "Explain quantum entanglement",
  "chosen": "Quantum entanglement is a phenomenon where...",
  "rejected": "idk lol its complicated science stuff"
}

Why DPO over RLHF: DPO is simpler (no reward model, no PPO optimizer), more stable, and achieves competitive results. It’s the current industry standard for alignment fine-tuning.

When to use: After SFT, when you want to improve helpfulness, reduce harmful outputs, or enforce a specific tone/style.

3.5 RLHF (Reinforcement Learning from Human Feedback)¶

The original alignment technique used by OpenAI for InstructGPT/ChatGPT. More complex than DPO but more flexible:

Train a reward model on preference pairs (predicts which response humans prefer)
Run PPO (Proximal Policy Optimization) to maximize the reward model’s score
KL penalty prevents the model from drifting too far from the SFT baseline

When to use: Large-scale alignment projects, when you need a reusable reward model, research settings.

Method Comparison Summary¶

Method	VRAM (7B)	Data Needed	Complexity	Quality
Full SFT	56GB	1K–100K	Low	Best
LoRA	28GB	1K–100K	Low	Great
QLoRA	6–8GB	1K–100K	Low	Good
DPO	28GB (LoRA)	500–10K pairs	Medium	Great
RLHF	80GB+	10K+ pairs + reward model	High	Great

4. Key Concepts Reference¶

PEFT (Parameter-Efficient Fine-Tuning)¶

An umbrella term for techniques that update only a small fraction of model parameters instead of all weights. LoRA, QLoRA, prefix tuning, and prompt tuning all fall under PEFT. The peft library from Hugging Face is the standard implementation.

Adapters¶

Small trainable modules inserted into a frozen pre-trained model. LoRA is the most popular adapter approach. After training, adapters can be merged back into base weights (zero inference overhead) or kept separate (swap adapters at runtime for different tasks).

Instruction Tuning¶

Fine-tuning a model to follow natural language instructions. The dataset consists of (instruction, response) pairs, often augmented with a system prompt. This is what turns a raw base model (next-token predictor) into an assistant model (instruction follower). Examples: Alpaca, FLAN, Dolly.

Chat Templates¶

Standardized formats that structure multi-turn conversations. Each model family has its own template:

# Llama 3.x
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Hello!<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

# ChatML (Qwen, etc.)
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant

Alignment¶

The process of making a model’s behavior match human values and intentions. Alignment fine-tuning (DPO/RLHF) reduces harmful outputs, improves helpfulness, and enforces desired communication styles. Misaligned models may be technically capable but unsafe or unhelpful.

Catastrophic Forgetting¶

A risk where fine-tuning on new data causes the model to forget previously learned capabilities. Mitigated by:

Using LoRA/PEFT (base weights remain frozen)
Including diverse data in the fine-tuning set
Using lower learning rates
Regularization techniques (EWC, replay buffers)

Tokenization and Special Tokens¶

Each model uses a different tokenizer and different special tokens (<|eot_id|>, <|im_end|>, </s>, etc.). Always use the tokenizer that came with the model. When fine-tuning, ensure labels are set to -100 for prompt tokens so the model only learns to predict the response.

Gradient Checkpointing¶

A memory optimization that recomputes activations during the backward pass instead of storing them. Reduces VRAM usage by ~40% at the cost of ~20% more compute. Almost always enabled during fine-tuning.

Flash Attention¶

An optimized attention implementation (FlashAttention-2) that dramatically reduces memory usage and increases speed for long sequences. Enable with attn_implementation="flash_attention_2" in Hugging Face models.

5. 2026 Recommended Models¶

These are the top open-source models to fine-tune in 2026, balancing capability, efficiency, and community support.

Top Picks by Use Case¶

Model	Size Options	VRAM (QLoRA)	Strengths	HuggingFace ID
Qwen 2.5	0.5B–72B	4–40GB	Multilingual, coding, math, long context (128K)	`Qwen/Qwen2.5-7B-Instruct`
Llama 3.3	70B	35GB	Strong reasoning, English, open license	`meta-llama/Llama-3.3-70B-Instruct`
Phi-4	14B	10GB	Exceptional quality per parameter, great for fine-tuning	`microsoft/phi-4`
Gemma 2	2B, 9B, 27B	2–20GB	Google-quality, efficient, research-friendly	`google/gemma-2-9b-it`
DeepSeek R1	1.5B–70B	2–40GB	Reasoning/CoT specialist, math, code	`deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`
Mistral	7B, 22B	6–18GB	Fast, efficient, strong baseline	`mistralai/Mistral-7B-Instruct-v0.3`

Recommendations by Hardware¶

Free tier (Colab T4, 16GB): Qwen 2.5-7B, Phi-4-14B (with QLoRA), Gemma 2-9B
Consumer GPU (RTX 4090, 24GB): Any 7B–14B with LoRA, 30B+ with QLoRA
Professional GPU (A100 40GB): 70B with QLoRA, 30B with full LoRA
Multi-GPU: 70B full fine-tune, 405B+ with FSDP/DeepSpeed

Model Sizes and What They’re Good For¶

1B–3B:  Fast inference, edge/mobile, simple tasks, very cheap to fine-tune
7B–9B:  Sweet spot! Good quality, affordable fine-tuning, 1 consumer GPU
13B–14B: Great quality, still fine-tunable on 1 GPU with QLoRA
30B–35B: Near-frontier quality, needs multi-GPU or professional GPU
70B+:   Frontier quality, expensive, use LoRA to make feasible

Recommended starting point: Qwen/Qwen2.5-7B-Instruct — excellent quality, multilingual, long context, and very fine-tuning-friendly.

6. Hardware Requirements Guide¶

GPU Memory Requirements¶

Rough rule: Model memory in GB ≈ (parameters in billions × bytes per param)

Full precision (fp32):  7B × 4 bytes = 28GB  (rare for inference)
Half precision (bf16):  7B × 2 bytes = 14GB  (standard)
8-bit quantization:     7B × 1 byte  =  7GB
4-bit quantization:     7B × 0.5 byte = ~4GB  (+ overhead ≈ 6GB total)

For fine-tuning, add:
  + Gradients:  ~same as model size (for trainable params)
  + Optimizer:  ~2× trainable params (Adam states)
  + Activations: depends on batch size and sequence length

Practical VRAM Guide¶

VRAM	GPU Examples	What You Can Fine-tune
8GB	RTX 3070/3080, T4 (Colab)	QLoRA on 7B, LoRA on 1–3B
16GB	RTX 3080 Ti, A4000, T4 (Colab Pro)	QLoRA on 7B–13B, LoRA on 7B
24GB	RTX 3090/4090, A5000	LoRA on 7B–13B, QLoRA on 30B
40GB	A100-40GB	LoRA on 30B, QLoRA on 70B
80GB	A100-80GB, H100	LoRA on 70B, Full fine-tune on 7B
2×80GB	2× H100	Full fine-tune on 13B–30B
8×80GB	8× H100	Full fine-tune on 70B

Free and Low-Cost Options¶

Platform	Free VRAM	Time Limit	Best For
Google Colab (free)	T4 16GB	~4h/session	Learning, small experiments
Google Colab Pro	A100 40GB	~24h/session	Serious training
Kaggle Notebooks	T4 16GB ×2	30h/week	Medium experiments
Lambda Cloud	A100 from $1.10/h	No limit	Production training
Vast.ai	RTX 4090 from $0.30/h	No limit	Cost-effective training
RunPod	A100 from $1.50/h	No limit	Production training

Memory Optimization Checklist¶

When you’re running out of VRAM, try these in order:

Reduce batch size (use gradient accumulation to maintain effective batch size)
Enable gradient checkpointing (gradient_checkpointing=True)
Use QLoRA instead of LoRA (4-bit quantization)
Reduce sequence length (max_seq_length)
Use Flash Attention 2 (attn_implementation="flash_attention_2")
Use bf16=True instead of fp32
Reduce LoRA rank (r=8 instead of r=64)
Use paged AdamW optimizer (optim="paged_adamw_8bit")

7. Learning Path Through Phase 12¶

This module is structured as a progressive sequence. Complete notebooks in order:

Notebook Sequence¶

#	Notebook	Topic	Prerequisites
00	`00_START_HERE.ipynb`	Overview (this notebook)	Phases 1–11
01	`01_dataset_preparation.ipynb`	Data cleaning, formatting, splits	Basic Python, pandas
02	`02_supervised_finetuning.ipynb`	Full SFT with TRL’s SFTTrainer	01 complete
03	`03_lora_basics.ipynb`	LoRA theory and practice	02 complete
04	`04_qlora_training.ipynb`	QLoRA on consumer hardware	03 complete
05	`05_dpo_alignment.ipynb`	DPO preference learning	02 complete
06	`06_evaluation.ipynb`	Evaluating fine-tuned models	02–04 complete
07	`07_deployment.ipynb`	Merging, quantizing, serving	06 complete

Estimated Time Investment¶

Understanding (notebooks 00-01):    2–4 hours
First fine-tune (notebooks 02-03):  4–8 hours (includes training time)
QLoRA & DPO (notebooks 04-05):      4–6 hours
Evaluation & Deployment (06-07):    3–5 hours

Total: 13–23 hours depending on experiments

Key Libraries You Will Use¶

transformers   # Model loading, tokenization, training
peft           # LoRA, QLoRA, and other PEFT methods
trl            # SFTTrainer, DPOTrainer, PPO (built on top of transformers)
datasets       # Dataset loading, processing, formatting
bitsandbytes   # 4-bit and 8-bit quantization for QLoRA
accelerate     # Multi-GPU and mixed precision training
wandb          # Experiment tracking and visualization

8. Quick Environment Check¶

Before starting any fine-tuning work, verify that all required libraries are installed and your GPU (if any) has sufficient VRAM. The cells below check Python, PyTorch, transformers, PEFT, TRL, and related packages. They also report GPU specifications and recommend which fine-tuning methods your hardware supports. If you are on Google Colab or Kaggle, the free-tier T4 GPU (16GB) is sufficient for QLoRA on 7B models.

# Cell 1: Install core fine-tuning libraries
# Run this if you are in a fresh environment (e.g., Colab)

# !pip install -q transformers datasets peft trl bitsandbytes accelerate
# !pip install -q torch --index-url https://download.pytorch.org/whl/cu121

print("Uncomment the lines above if you need to install dependencies.")
print("For local environments, see the README for setup instructions.")

# Cell 2: Check library versions

import importlib
import sys

print(f"Python version: {sys.version.split()[0]}")
print()

libraries = [
    ("torch", "torch"),
    ("transformers", "transformers"),
    ("peft", "peft"),
    ("trl", "trl"),
    ("datasets", "datasets"),
    ("bitsandbytes", "bitsandbytes"),
    ("accelerate", "accelerate"),
    ("pandas", "pandas"),
    ("numpy", "numpy"),
]

print(f"{'Library':<20} {'Version':<15} {'Status'}")
print("-" * 50)

for lib_name, import_name in libraries:
    try:
        module = importlib.import_module(import_name)
        version = getattr(module, "__version__", "unknown")
        print(f"{lib_name:<20} {version:<15} OK")
    except ImportError:
        print(f"{lib_name:<20} {'NOT FOUND':<15} MISSING - run pip install {lib_name}")

# Cell 3: Check GPU availability and memory

import torch

print("=" * 50)
print("GPU / Hardware Report")
print("=" * 50)

# Check CUDA
cuda_available = torch.cuda.is_available()
print(f"CUDA available:     {cuda_available}")

if cuda_available:
    num_gpus = torch.cuda.device_count()
    print(f"Number of GPUs:     {num_gpus}")
    print()
    for i in range(num_gpus):
        props = torch.cuda.get_device_properties(i)
        total_mem = props.total_memory / 1024**3
        print(f"  GPU {i}: {props.name}")
        print(f"    Total VRAM:     {total_mem:.1f} GB")
        print(f"    CUDA Capability: {props.major}.{props.minor}")
        # Check what you can run
        if total_mem >= 40:
            print(f"    Can run:        LoRA on 30B, QLoRA on 70B")
        elif total_mem >= 24:
            print(f"    Can run:        LoRA on 7B-13B, QLoRA on 30B")
        elif total_mem >= 16:
            print(f"    Can run:        QLoRA on 7B-13B, LoRA on 7B")
        elif total_mem >= 8:
            print(f"    Can run:        QLoRA on 7B, LoRA on 1-3B")
        else:
            print(f"    Can run:        QLoRA on 1-3B only")
else:
    print()
    print("No GPU detected. You can still:")
    print("  - Run this notebook (CPU only)")
    print("  - Use Google Colab for GPU training (free T4)")
    print("  - Use Kaggle Notebooks (free T4 x2)")

# Check MPS (Apple Silicon)
if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    print()
    print("Apple Silicon MPS: Available")
    print("  Note: MPS supports fine-tuning but bitsandbytes (QLoRA) is not supported.")
    print("  Use LoRA without quantization on Apple Silicon.")

print()
print("PyTorch version:", torch.__version__)
print("CUDA version:", torch.version.cuda if torch.version.cuda else "N/A")

# Cell 4: Quick sanity check - verify transformers and datasets work

from transformers import AutoTokenizer
from datasets import load_dataset
import warnings
warnings.filterwarnings("ignore")

print("Testing transformers tokenizer...")
try:
    # Load a tiny tokenizer (no model weights, just tokenizer files)
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    text = "Fine-tuning LLMs is powerful!"
    tokens = tokenizer(text, return_tensors="pt")
    print(f"  Input:  '{text}'")
    print(f"  Tokens: {tokens['input_ids'].tolist()[0]}")
    print(f"  Length: {tokens['input_ids'].shape[1]} tokens")
    print("  transformers: OK")
except Exception as e:
    print(f"  transformers: ERROR - {e}")

print()
print("Testing datasets library...")
try:
    # Load a tiny slice of a public dataset
    dataset = load_dataset("databricks/databricks-dolly-15k", split="train[:5]")
    print(f"  Loaded sample dataset: {len(dataset)} examples")
    print(f"  Columns: {dataset.column_names}")
    print(f"  First example category: {dataset[0]['category']}")
    print("  datasets: OK")
except Exception as e:
    print(f"  datasets: ERROR - {e}")
    print("  (This may be a network issue. Try again or check your connection.)")

print()
print("Environment check complete!")

# Cell 5: Check PEFT and TRL availability

print("Testing PEFT (LoRA library)...")
try:
    from peft import LoraConfig, TaskType, get_peft_model
    # Create a sample LoRA config
    config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "v_proj"],
        lora_dropout=0.05,
        bias="none",
    )
    print(f"  LoRA config created: r={config.r}, alpha={config.lora_alpha}")
    print(f"  Target modules: {config.target_modules}")
    print("  peft: OK")
except Exception as e:
    print(f"  peft: ERROR - {e}")

print()
print("Testing TRL (training library)...")
try:
    from trl import SFTConfig, DPOConfig
    sft_config = SFTConfig(
        output_dir="/tmp/test",
        max_seq_length=512,
        num_train_epochs=1,
    )
    print(f"  SFTConfig created: max_seq_length={sft_config.max_seq_length}")
    print("  trl: OK")
except Exception as e:
    print(f"  trl: ERROR - {e}")

print()
print("All checks complete. You are ready to start fine-tuning!")

9. Fine-tuning Best Practices¶

Data Quality Rules¶

DO:

Start with 500–2000 high-quality examples and scale from there
Manually review at least 50–100 examples before training
Match your training data format exactly to how the model will be used
Include diverse examples covering edge cases
Use the model’s own chat template (from the tokenizer)
Set labels to -100 for prompt tokens (only train on response tokens)

DON’T:

Use data scraped without cleaning (HTML tags, formatting artifacts, duplicates)
Mix response formats (some with markdown, some without)
Use extremely long or extremely short responses in the same batch
Include personally identifiable information (PII) in training data
Trust data quality without inspection

Training Hyperparameters¶

Safe starting defaults for QLoRA fine-tuning:

learning_rate = 2e-4          # Higher than full fine-tune is OK for LoRA
num_train_epochs = 2-3        # Rarely need more than 3 epochs
per_device_train_batch_size = 2  # Start small, increase if VRAM allows
gradient_accumulation_steps = 8  # Effective batch = 2 × 8 = 16
max_seq_length = 2048          # Match to your data; longer = more VRAM
lr_scheduler_type = "cosine"   # Or "linear"
warmup_ratio = 0.05            # 5% warmup steps
weight_decay = 0.01

Evaluation Tips¶

Always hold out 10–20% of data for validation
Monitor training loss AND validation loss (watch for overfitting)
Manually inspect model outputs, not just metrics
Use the same evaluation prompts throughout training for comparability
Compare against the base model and against GPT-4 if possible

10. Key Takeaways¶

What You Learned in This Notebook¶

Fine-tuning adapts pre-trained models to specific tasks without training from scratch — a massive cost and time saving.
Choose your approach wisely: Prompt engineering first, RAG for knowledge-intensive tasks, fine-tuning for behavior/style/format at scale.
LoRA and QLoRA have democratized fine-tuning — you can now fine-tune a 7B model on a single consumer GPU with QLoRA.
SFT → DPO is the standard pipeline: First teach the model to follow instructions (SFT), then align it with preferences (DPO).
Data quality > data quantity — 1000 excellent examples beat 100,000 mediocre ones.
Top 2026 models to fine-tune: Qwen 2.5 (multilingual), Phi-4 (efficiency), Llama 3.3 (general), Gemma 2 (research), DeepSeek R1 (reasoning).

Next Steps¶

You are ready to move on to the next notebook:

Next: 01_dataset_preparation.ipynb

You will learn how to:

Find and load datasets from Hugging Face Hub
Clean and filter data for quality
Format datasets for different model families
Create your own custom fine-tuning datasets
Split data into train/validation sets correctly

Zero-to-AI Curriculum — Phase 12: LLM Fine-tuning
Last updated: February 2026