Phase 12: LLM Fine-tuning โ Start Hereยถ
Welcome to Phase 12 of the Zero-to-AI curriculum. This module covers LLM Fine-tuning โ one of the most powerful techniques for adapting large language models to specific tasks, domains, and behaviors.
By the end of this phase, you will be able to:
Decide when fine-tuning is the right approach vs. prompt engineering or RAG
Understand the full landscape of fine-tuning methods (SFT, LoRA, QLoRA, DPO, RLHF)
Prepare high-quality datasets for fine-tuning
Train a fine-tuned model using efficient techniques like LoRA and QLoRA
Evaluate and deploy your fine-tuned models
Prerequisites: Python, basic ML concepts, familiarity with Hugging Face
transformers. Complete Phases 1โ11 first.
1. What is LLM Fine-tuning?ยถ
Pre-training is the process of training a model on massive amounts of text data to learn general language patterns. This is what creates models like GPT-4, Llama 3.3, and Qwen 2.5. It costs millions of dollars and requires thousands of GPUs.
Fine-tuning starts from a pre-trained model and continues training on a smaller, task-specific dataset. This allows the model to:
Learn specialized domain knowledge (medical, legal, finance)
Follow a specific format or style consistently
Adopt a particular persona or tone
Refuse harmful requests (alignment/safety fine-tuning)
Improve performance on a narrow task (code generation, SQL, etc.)
The Two-Stage Mental Modelยถ
Stage 1: Pre-training (done for you)
Massive web data โ Foundation Model
Cost: $1Mโ$100M+, weeks of compute
Stage 2: Fine-tuning (what you do)
Your labeled data โ Specialized Model
Cost: $10โ$1000, hours of compute
Types of Fine-tuningยถ
Type |
What it does |
Common Use Case |
|---|---|---|
Instruction Fine-tuning (SFT) |
Teaches the model to follow instructions |
General-purpose assistants |
Domain Adaptation |
Injects domain-specific knowledge |
Medical/legal/finance models |
Task-specific Fine-tuning |
Optimizes for one task |
SQL generation, classification |
Alignment (DPO/RLHF) |
Teaches human preferences |
Safety, tone, helpfulness |
Continual Pre-training |
Adds new knowledge to base weights |
Keeping models current |
2. When to Fine-tune vs. Other Approachesยถ
Fine-tuning is powerful but not always the right tool. Use this decision framework:
The Three Approaches Comparedยถ
Approach |
Setup Cost |
Latency |
Privacy |
Consistency |
Best When |
|---|---|---|---|---|---|
Prompt Engineering |
Very low |
Low |
Medium |
Medium |
Exploring capability, prototyping |
RAG |
Medium |
Medium |
High |
High |
Factual Q&A, up-to-date info needed |
Fine-tuning |
High |
Low |
Very High |
Very High |
Style/format/behavior, no internet access, high volume |
Decision Treeยถ
START
โ
โโโ Can a good prompt solve this? โโYESโโ> Use Prompt Engineering
โ โ
โ NO
โ โ
โโโ Does the task require up-to-date or private documents? โโYESโโ> Use RAG
โ โ
โ NO
โ โ
โโโ Do you need consistent style/format/behavior at scale? โโYESโโโ
โ โ โ
โ NO โผ
โ โ FINE-TUNING
โโโ Is data privacy critical (can't send data to API)? โโYESโโโโโโโ
โ โ
โ NO
โ โ
โโโ Is inference cost at scale a major concern? โโYESโโ> FINE-TUNING (smaller model)
โ
NO
โ
โโโ> Try Prompt Engineering or RAG first
Green Flags for Fine-tuningยถ
You have 500+ high-quality labeled examples
You need a very specific output format every time
Prompt engineering gives inconsistent results
You need to run the model on private/sensitive data
You need to reduce inference costs (smaller fine-tuned model > large prompted model)
You need a specific persona or communication style consistently
Red Flags (Donโt Fine-tune Yet)ยถ
You have fewer than 100 examples
You havenโt tried few-shot prompting first
The task requires real-time knowledge (use RAG)
You need to update behavior frequently (retraining is expensive)
The base model already does the task well with prompting
3. Fine-tuning Methods: The Full Landscapeยถ
3.1 Supervised Fine-tuning (SFT)ยถ
The most basic form of fine-tuning. You provide (input, output) pairs and train the model to predict the output given the input using standard cross-entropy loss.
How it works:
Prepare a dataset of instruction-response pairs
Train the model with teacher forcing (predict next token given previous tokens)
All model weights are updated (full fine-tuning) OR only a small adapter (PEFT)
When to use: Starting point for any fine-tuning project. Always do SFT before DPO/RLHF.
3.2 LoRA (Low-Rank Adaptation)ยถ
Instead of updating all model weights (which would require 14GB+ for a 7B model), LoRA freezes the original weights and adds small trainable rank-decomposition matrices alongside them.
Key insight: Weight updates during fine-tuning have a low intrinsic rank. So instead of a full matrix W (e.g., 4096ร4096 = 16M parameters), you learn two small matrices A (4096รr) and B (rร4096) where r is typically 8โ64.
Original: W (frozen)
LoRA adds: W + BA (only A and B are trained)
Trainable params: 2 ร d ร r vs d ร d
Example: r=16, d=4096 โ 131K vs 16M (99% reduction!)
Key hyperparameters:
r(rank): 8โ64, higher = more expressive but more memorylora_alpha: scaling factor, often set to r or 2rtarget_modules: which layers to apply LoRA to (attention Q,K,V,O matrices)lora_dropout: regularization, typically 0.05โ0.1
When to use: GPU with 16GB+ VRAM, 7Bโ13B models.
3.3 QLoRA (Quantized LoRA)ยถ
QLoRA combines LoRA with 4-bit quantization of the base model weights. This dramatically reduces VRAM requirements:
Full fine-tune 7B model: ~56GB VRAM
LoRA 7B model: ~28GB VRAM
QLoRA 7B model: ~6-8GB VRAM โ fits on a single consumer GPU!
How it works:
Load base model in 4-bit (NF4 quantization via
bitsandbytes)Add LoRA adapters in full precision (16-bit)
Compute gradients in 16-bit but keep base weights in 4-bit
Quality trade-off: Typically 1โ3% performance drop vs full fine-tuning. Often acceptable for the 8ร memory savings.
When to use: Consumer GPU (RTX 3090/4090), Colab, Kaggle. 7Bโ70B models.
3.4 DPO (Direct Preference Optimization)ยถ
After SFT, the model can generate responses but may not know which responses humans prefer. DPO trains on preference data (chosen vs. rejected response pairs) to align the model with human preferences.
Dataset format:
{
"prompt": "Explain quantum entanglement",
"chosen": "Quantum entanglement is a phenomenon where...",
"rejected": "idk lol its complicated science stuff"
}
Why DPO over RLHF: DPO is simpler (no reward model, no PPO optimizer), more stable, and achieves competitive results. Itโs the current industry standard for alignment fine-tuning.
When to use: After SFT, when you want to improve helpfulness, reduce harmful outputs, or enforce a specific tone/style.
3.5 RLHF (Reinforcement Learning from Human Feedback)ยถ
The original alignment technique used by OpenAI for InstructGPT/ChatGPT. More complex than DPO but more flexible:
Train a reward model on preference pairs (predicts which response humans prefer)
Run PPO (Proximal Policy Optimization) to maximize the reward modelโs score
KL penalty prevents the model from drifting too far from the SFT baseline
When to use: Large-scale alignment projects, when you need a reusable reward model, research settings.
Method Comparison Summaryยถ
Method |
VRAM (7B) |
Data Needed |
Complexity |
Quality |
|---|---|---|---|---|
Full SFT |
56GB |
1Kโ100K |
Low |
Best |
LoRA |
28GB |
1Kโ100K |
Low |
Great |
QLoRA |
6โ8GB |
1Kโ100K |
Low |
Good |
DPO |
28GB (LoRA) |
500โ10K pairs |
Medium |
Great |
RLHF |
80GB+ |
10K+ pairs + reward model |
High |
Great |
4. Key Concepts Referenceยถ
PEFT (Parameter-Efficient Fine-Tuning)ยถ
An umbrella term for techniques that update only a small fraction of model parameters instead of all weights. LoRA, QLoRA, prefix tuning, and prompt tuning all fall under PEFT. The peft library from Hugging Face is the standard implementation.
Adaptersยถ
Small trainable modules inserted into a frozen pre-trained model. LoRA is the most popular adapter approach. After training, adapters can be merged back into base weights (zero inference overhead) or kept separate (swap adapters at runtime for different tasks).
Instruction Tuningยถ
Fine-tuning a model to follow natural language instructions. The dataset consists of (instruction, response) pairs, often augmented with a system prompt. This is what turns a raw base model (next-token predictor) into an assistant model (instruction follower). Examples: Alpaca, FLAN, Dolly.
Chat Templatesยถ
Standardized formats that structure multi-turn conversations. Each model family has its own template:
# Llama 3.x
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Hello!<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
# ChatML (Qwen, etc.)
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
Alignmentยถ
The process of making a modelโs behavior match human values and intentions. Alignment fine-tuning (DPO/RLHF) reduces harmful outputs, improves helpfulness, and enforces desired communication styles. Misaligned models may be technically capable but unsafe or unhelpful.
Catastrophic Forgettingยถ
A risk where fine-tuning on new data causes the model to forget previously learned capabilities. Mitigated by:
Using LoRA/PEFT (base weights remain frozen)
Including diverse data in the fine-tuning set
Using lower learning rates
Regularization techniques (EWC, replay buffers)
Tokenization and Special Tokensยถ
Each model uses a different tokenizer and different special tokens (<|eot_id|>, <|im_end|>, </s>, etc.). Always use the tokenizer that came with the model. When fine-tuning, ensure labels are set to -100 for prompt tokens so the model only learns to predict the response.
Gradient Checkpointingยถ
A memory optimization that recomputes activations during the backward pass instead of storing them. Reduces VRAM usage by ~40% at the cost of ~20% more compute. Almost always enabled during fine-tuning.
Flash Attentionยถ
An optimized attention implementation (FlashAttention-2) that dramatically reduces memory usage and increases speed for long sequences. Enable with attn_implementation="flash_attention_2" in Hugging Face models.
5. 2026 Recommended Modelsยถ
These are the top open-source models to fine-tune in 2026, balancing capability, efficiency, and community support.
Top Picks by Use Caseยถ
Model |
Size Options |
VRAM (QLoRA) |
Strengths |
HuggingFace ID |
|---|---|---|---|---|
Qwen 2.5 |
0.5Bโ72B |
4โ40GB |
Multilingual, coding, math, long context (128K) |
|
Llama 3.3 |
70B |
35GB |
Strong reasoning, English, open license |
|
Phi-4 |
14B |
10GB |
Exceptional quality per parameter, great for fine-tuning |
|
Gemma 2 |
2B, 9B, 27B |
2โ20GB |
Google-quality, efficient, research-friendly |
|
DeepSeek R1 |
1.5Bโ70B |
2โ40GB |
Reasoning/CoT specialist, math, code |
|
Mistral |
7B, 22B |
6โ18GB |
Fast, efficient, strong baseline |
|
Recommendations by Hardwareยถ
Free tier (Colab T4, 16GB): Qwen 2.5-7B, Phi-4-14B (with QLoRA), Gemma 2-9B
Consumer GPU (RTX 4090, 24GB): Any 7Bโ14B with LoRA, 30B+ with QLoRA
Professional GPU (A100 40GB): 70B with QLoRA, 30B with full LoRA
Multi-GPU: 70B full fine-tune, 405B+ with FSDP/DeepSpeed
Model Sizes and What Theyโre Good Forยถ
1Bโ3B: Fast inference, edge/mobile, simple tasks, very cheap to fine-tune
7Bโ9B: Sweet spot! Good quality, affordable fine-tuning, 1 consumer GPU
13Bโ14B: Great quality, still fine-tunable on 1 GPU with QLoRA
30Bโ35B: Near-frontier quality, needs multi-GPU or professional GPU
70B+: Frontier quality, expensive, use LoRA to make feasible
Recommended starting point:
Qwen/Qwen2.5-7B-Instructโ excellent quality, multilingual, long context, and very fine-tuning-friendly.
6. Hardware Requirements Guideยถ
GPU Memory Requirementsยถ
Rough rule: Model memory in GB โ (parameters in billions ร bytes per param)
Full precision (fp32): 7B ร 4 bytes = 28GB (rare for inference)
Half precision (bf16): 7B ร 2 bytes = 14GB (standard)
8-bit quantization: 7B ร 1 byte = 7GB
4-bit quantization: 7B ร 0.5 byte = ~4GB (+ overhead โ 6GB total)
For fine-tuning, add:
+ Gradients: ~same as model size (for trainable params)
+ Optimizer: ~2ร trainable params (Adam states)
+ Activations: depends on batch size and sequence length
Practical VRAM Guideยถ
VRAM |
GPU Examples |
What You Can Fine-tune |
|---|---|---|
8GB |
RTX 3070/3080, T4 (Colab) |
QLoRA on 7B, LoRA on 1โ3B |
16GB |
RTX 3080 Ti, A4000, T4 (Colab Pro) |
QLoRA on 7Bโ13B, LoRA on 7B |
24GB |
RTX 3090/4090, A5000 |
LoRA on 7Bโ13B, QLoRA on 30B |
40GB |
A100-40GB |
LoRA on 30B, QLoRA on 70B |
80GB |
A100-80GB, H100 |
LoRA on 70B, Full fine-tune on 7B |
2ร80GB |
2ร H100 |
Full fine-tune on 13Bโ30B |
8ร80GB |
8ร H100 |
Full fine-tune on 70B |
Free and Low-Cost Optionsยถ
Platform |
Free VRAM |
Time Limit |
Best For |
|---|---|---|---|
Google Colab (free) |
T4 16GB |
~4h/session |
Learning, small experiments |
Google Colab Pro |
A100 40GB |
~24h/session |
Serious training |
Kaggle Notebooks |
T4 16GB ร2 |
30h/week |
Medium experiments |
Lambda Cloud |
A100 from $1.10/h |
No limit |
Production training |
RTX 4090 from $0.30/h |
No limit |
Cost-effective training |
|
RunPod |
A100 from $1.50/h |
No limit |
Production training |
Memory Optimization Checklistยถ
When youโre running out of VRAM, try these in order:
Reduce batch size (use gradient accumulation to maintain effective batch size)
Enable gradient checkpointing (
gradient_checkpointing=True)Use QLoRA instead of LoRA (4-bit quantization)
Reduce sequence length (
max_seq_length)Use Flash Attention 2 (
attn_implementation="flash_attention_2")Use
bf16=Trueinstead offp32Reduce LoRA rank (
r=8instead ofr=64)Use paged AdamW optimizer (
optim="paged_adamw_8bit")
7. Learning Path Through Phase 12ยถ
This module is structured as a progressive sequence. Complete notebooks in order:
Notebook Sequenceยถ
# |
Notebook |
Topic |
Prerequisites |
|---|---|---|---|
00 |
|
Overview (this notebook) |
Phases 1โ11 |
01 |
|
Data cleaning, formatting, splits |
Basic Python, pandas |
02 |
|
Full SFT with TRLโs SFTTrainer |
01 complete |
03 |
|
LoRA theory and practice |
02 complete |
04 |
|
QLoRA on consumer hardware |
03 complete |
05 |
|
DPO preference learning |
02 complete |
06 |
|
Evaluating fine-tuned models |
02โ04 complete |
07 |
|
Merging, quantizing, serving |
06 complete |
Estimated Time Investmentยถ
Understanding (notebooks 00-01): 2โ4 hours
First fine-tune (notebooks 02-03): 4โ8 hours (includes training time)
QLoRA & DPO (notebooks 04-05): 4โ6 hours
Evaluation & Deployment (06-07): 3โ5 hours
Total: 13โ23 hours depending on experiments
Key Libraries You Will Useยถ
transformers # Model loading, tokenization, training
peft # LoRA, QLoRA, and other PEFT methods
trl # SFTTrainer, DPOTrainer, PPO (built on top of transformers)
datasets # Dataset loading, processing, formatting
bitsandbytes # 4-bit and 8-bit quantization for QLoRA
accelerate # Multi-GPU and mixed precision training
wandb # Experiment tracking and visualization
8. Quick Environment Checkยถ
Before starting any fine-tuning work, verify that all required libraries are installed and your GPU (if any) has sufficient VRAM. The cells below check Python, PyTorch, transformers, PEFT, TRL, and related packages. They also report GPU specifications and recommend which fine-tuning methods your hardware supports. If you are on Google Colab or Kaggle, the free-tier T4 GPU (16GB) is sufficient for QLoRA on 7B models.
# Cell 1: Install core fine-tuning libraries
# Run this if you are in a fresh environment (e.g., Colab)
# !pip install -q transformers datasets peft trl bitsandbytes accelerate
# !pip install -q torch --index-url https://download.pytorch.org/whl/cu121
print("Uncomment the lines above if you need to install dependencies.")
print("For local environments, see the README for setup instructions.")
# Cell 2: Check library versions
import importlib
import sys
print(f"Python version: {sys.version.split()[0]}")
print()
libraries = [
("torch", "torch"),
("transformers", "transformers"),
("peft", "peft"),
("trl", "trl"),
("datasets", "datasets"),
("bitsandbytes", "bitsandbytes"),
("accelerate", "accelerate"),
("pandas", "pandas"),
("numpy", "numpy"),
]
print(f"{'Library':<20} {'Version':<15} {'Status'}")
print("-" * 50)
for lib_name, import_name in libraries:
try:
module = importlib.import_module(import_name)
version = getattr(module, "__version__", "unknown")
print(f"{lib_name:<20} {version:<15} OK")
except ImportError:
print(f"{lib_name:<20} {'NOT FOUND':<15} MISSING - run pip install {lib_name}")
# Cell 3: Check GPU availability and memory
import torch
print("=" * 50)
print("GPU / Hardware Report")
print("=" * 50)
# Check CUDA
cuda_available = torch.cuda.is_available()
print(f"CUDA available: {cuda_available}")
if cuda_available:
num_gpus = torch.cuda.device_count()
print(f"Number of GPUs: {num_gpus}")
print()
for i in range(num_gpus):
props = torch.cuda.get_device_properties(i)
total_mem = props.total_memory / 1024**3
print(f" GPU {i}: {props.name}")
print(f" Total VRAM: {total_mem:.1f} GB")
print(f" CUDA Capability: {props.major}.{props.minor}")
# Check what you can run
if total_mem >= 40:
print(f" Can run: LoRA on 30B, QLoRA on 70B")
elif total_mem >= 24:
print(f" Can run: LoRA on 7B-13B, QLoRA on 30B")
elif total_mem >= 16:
print(f" Can run: QLoRA on 7B-13B, LoRA on 7B")
elif total_mem >= 8:
print(f" Can run: QLoRA on 7B, LoRA on 1-3B")
else:
print(f" Can run: QLoRA on 1-3B only")
else:
print()
print("No GPU detected. You can still:")
print(" - Run this notebook (CPU only)")
print(" - Use Google Colab for GPU training (free T4)")
print(" - Use Kaggle Notebooks (free T4 x2)")
# Check MPS (Apple Silicon)
if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
print()
print("Apple Silicon MPS: Available")
print(" Note: MPS supports fine-tuning but bitsandbytes (QLoRA) is not supported.")
print(" Use LoRA without quantization on Apple Silicon.")
print()
print("PyTorch version:", torch.__version__)
print("CUDA version:", torch.version.cuda if torch.version.cuda else "N/A")
# Cell 4: Quick sanity check - verify transformers and datasets work
from transformers import AutoTokenizer
from datasets import load_dataset
import warnings
warnings.filterwarnings("ignore")
print("Testing transformers tokenizer...")
try:
# Load a tiny tokenizer (no model weights, just tokenizer files)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Fine-tuning LLMs is powerful!"
tokens = tokenizer(text, return_tensors="pt")
print(f" Input: '{text}'")
print(f" Tokens: {tokens['input_ids'].tolist()[0]}")
print(f" Length: {tokens['input_ids'].shape[1]} tokens")
print(" transformers: OK")
except Exception as e:
print(f" transformers: ERROR - {e}")
print()
print("Testing datasets library...")
try:
# Load a tiny slice of a public dataset
dataset = load_dataset("databricks/databricks-dolly-15k", split="train[:5]")
print(f" Loaded sample dataset: {len(dataset)} examples")
print(f" Columns: {dataset.column_names}")
print(f" First example category: {dataset[0]['category']}")
print(" datasets: OK")
except Exception as e:
print(f" datasets: ERROR - {e}")
print(" (This may be a network issue. Try again or check your connection.)")
print()
print("Environment check complete!")
# Cell 5: Check PEFT and TRL availability
print("Testing PEFT (LoRA library)...")
try:
from peft import LoraConfig, TaskType, get_peft_model
# Create a sample LoRA config
config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
)
print(f" LoRA config created: r={config.r}, alpha={config.lora_alpha}")
print(f" Target modules: {config.target_modules}")
print(" peft: OK")
except Exception as e:
print(f" peft: ERROR - {e}")
print()
print("Testing TRL (training library)...")
try:
from trl import SFTConfig, DPOConfig
sft_config = SFTConfig(
output_dir="/tmp/test",
max_seq_length=512,
num_train_epochs=1,
)
print(f" SFTConfig created: max_seq_length={sft_config.max_seq_length}")
print(" trl: OK")
except Exception as e:
print(f" trl: ERROR - {e}")
print()
print("All checks complete. You are ready to start fine-tuning!")
9. Fine-tuning Best Practicesยถ
Data Quality Rulesยถ
DO:
Start with 500โ2000 high-quality examples and scale from there
Manually review at least 50โ100 examples before training
Match your training data format exactly to how the model will be used
Include diverse examples covering edge cases
Use the modelโs own chat template (from the tokenizer)
Set labels to
-100for prompt tokens (only train on response tokens)
DONโT:
Use data scraped without cleaning (HTML tags, formatting artifacts, duplicates)
Mix response formats (some with markdown, some without)
Use extremely long or extremely short responses in the same batch
Include personally identifiable information (PII) in training data
Trust data quality without inspection
Training Hyperparametersยถ
Safe starting defaults for QLoRA fine-tuning:
learning_rate = 2e-4 # Higher than full fine-tune is OK for LoRA
num_train_epochs = 2-3 # Rarely need more than 3 epochs
per_device_train_batch_size = 2 # Start small, increase if VRAM allows
gradient_accumulation_steps = 8 # Effective batch = 2 ร 8 = 16
max_seq_length = 2048 # Match to your data; longer = more VRAM
lr_scheduler_type = "cosine" # Or "linear"
warmup_ratio = 0.05 # 5% warmup steps
weight_decay = 0.01
Evaluation Tipsยถ
Always hold out 10โ20% of data for validation
Monitor training loss AND validation loss (watch for overfitting)
Manually inspect model outputs, not just metrics
Use the same evaluation prompts throughout training for comparability
Compare against the base model and against GPT-4 if possible
10. Key Takeawaysยถ
What You Learned in This Notebookยถ
Fine-tuning adapts pre-trained models to specific tasks without training from scratch โ a massive cost and time saving.
Choose your approach wisely: Prompt engineering first, RAG for knowledge-intensive tasks, fine-tuning for behavior/style/format at scale.
LoRA and QLoRA have democratized fine-tuning โ you can now fine-tune a 7B model on a single consumer GPU with QLoRA.
SFT โ DPO is the standard pipeline: First teach the model to follow instructions (SFT), then align it with preferences (DPO).
Data quality > data quantity โ 1000 excellent examples beat 100,000 mediocre ones.
Top 2026 models to fine-tune: Qwen 2.5 (multilingual), Phi-4 (efficiency), Llama 3.3 (general), Gemma 2 (research), DeepSeek R1 (reasoning).
Next Stepsยถ
You are ready to move on to the next notebook:
Next: 01_dataset_preparation.ipynb
You will learn how to:
Find and load datasets from Hugging Face Hub
Clean and filter data for quality
Format datasets for different model families
Create your own custom fine-tuning datasets
Split data into train/validation sets correctly
Zero-to-AI Curriculum โ Phase 12: LLM Fine-tuning
Last updated: February 2026