Phase 12: LLM Fine-tuning โ€” Start Hereยถ

Welcome to Phase 12 of the Zero-to-AI curriculum. This module covers LLM Fine-tuning โ€” one of the most powerful techniques for adapting large language models to specific tasks, domains, and behaviors.

By the end of this phase, you will be able to:

  • Decide when fine-tuning is the right approach vs. prompt engineering or RAG

  • Understand the full landscape of fine-tuning methods (SFT, LoRA, QLoRA, DPO, RLHF)

  • Prepare high-quality datasets for fine-tuning

  • Train a fine-tuned model using efficient techniques like LoRA and QLoRA

  • Evaluate and deploy your fine-tuned models

Prerequisites: Python, basic ML concepts, familiarity with Hugging Face transformers. Complete Phases 1โ€“11 first.

1. What is LLM Fine-tuning?ยถ

Pre-training is the process of training a model on massive amounts of text data to learn general language patterns. This is what creates models like GPT-4, Llama 3.3, and Qwen 2.5. It costs millions of dollars and requires thousands of GPUs.

Fine-tuning starts from a pre-trained model and continues training on a smaller, task-specific dataset. This allows the model to:

  • Learn specialized domain knowledge (medical, legal, finance)

  • Follow a specific format or style consistently

  • Adopt a particular persona or tone

  • Refuse harmful requests (alignment/safety fine-tuning)

  • Improve performance on a narrow task (code generation, SQL, etc.)

The Two-Stage Mental Modelยถ

Stage 1: Pre-training (done for you)
  Massive web data โ†’ Foundation Model
  Cost: $1Mโ€“$100M+, weeks of compute

Stage 2: Fine-tuning (what you do)
  Your labeled data โ†’ Specialized Model
  Cost: $10โ€“$1000, hours of compute

Types of Fine-tuningยถ

Type

What it does

Common Use Case

Instruction Fine-tuning (SFT)

Teaches the model to follow instructions

General-purpose assistants

Domain Adaptation

Injects domain-specific knowledge

Medical/legal/finance models

Task-specific Fine-tuning

Optimizes for one task

SQL generation, classification

Alignment (DPO/RLHF)

Teaches human preferences

Safety, tone, helpfulness

Continual Pre-training

Adds new knowledge to base weights

Keeping models current

2. When to Fine-tune vs. Other Approachesยถ

Fine-tuning is powerful but not always the right tool. Use this decision framework:

The Three Approaches Comparedยถ

Approach

Setup Cost

Latency

Privacy

Consistency

Best When

Prompt Engineering

Very low

Low

Medium

Medium

Exploring capability, prototyping

RAG

Medium

Medium

High

High

Factual Q&A, up-to-date info needed

Fine-tuning

High

Low

Very High

Very High

Style/format/behavior, no internet access, high volume

Decision Treeยถ

START
โ”‚
โ”œโ”€โ”€ Can a good prompt solve this? โ”€โ”€YESโ”€โ”€> Use Prompt Engineering
โ”‚         โ”‚
โ”‚         NO
โ”‚         โ”‚
โ”œโ”€โ”€ Does the task require up-to-date or private documents? โ”€โ”€YESโ”€โ”€> Use RAG
โ”‚         โ”‚
โ”‚         NO
โ”‚         โ”‚
โ”œโ”€โ”€ Do you need consistent style/format/behavior at scale? โ”€โ”€YESโ”€โ”€โ”
โ”‚         โ”‚                                                        โ”‚
โ”‚         NO                                                       โ–ผ
โ”‚         โ”‚                                               FINE-TUNING
โ”œโ”€โ”€ Is data privacy critical (can't send data to API)? โ”€โ”€YESโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚         โ”‚
โ”‚         NO
โ”‚         โ”‚
โ””โ”€โ”€ Is inference cost at scale a major concern? โ”€โ”€YESโ”€โ”€> FINE-TUNING (smaller model)
          โ”‚
          NO
          โ”‚
          โ””โ”€โ”€> Try Prompt Engineering or RAG first

Green Flags for Fine-tuningยถ

  • You have 500+ high-quality labeled examples

  • You need a very specific output format every time

  • Prompt engineering gives inconsistent results

  • You need to run the model on private/sensitive data

  • You need to reduce inference costs (smaller fine-tuned model > large prompted model)

  • You need a specific persona or communication style consistently

Red Flags (Donโ€™t Fine-tune Yet)ยถ

  • You have fewer than 100 examples

  • You havenโ€™t tried few-shot prompting first

  • The task requires real-time knowledge (use RAG)

  • You need to update behavior frequently (retraining is expensive)

  • The base model already does the task well with prompting

3. Fine-tuning Methods: The Full Landscapeยถ

3.1 Supervised Fine-tuning (SFT)ยถ

The most basic form of fine-tuning. You provide (input, output) pairs and train the model to predict the output given the input using standard cross-entropy loss.

How it works:

  • Prepare a dataset of instruction-response pairs

  • Train the model with teacher forcing (predict next token given previous tokens)

  • All model weights are updated (full fine-tuning) OR only a small adapter (PEFT)

When to use: Starting point for any fine-tuning project. Always do SFT before DPO/RLHF.

3.2 LoRA (Low-Rank Adaptation)ยถ

Instead of updating all model weights (which would require 14GB+ for a 7B model), LoRA freezes the original weights and adds small trainable rank-decomposition matrices alongside them.

Key insight: Weight updates during fine-tuning have a low intrinsic rank. So instead of a full matrix W (e.g., 4096ร—4096 = 16M parameters), you learn two small matrices A (4096ร—r) and B (rร—4096) where r is typically 8โ€“64.

Original: W (frozen)
LoRA adds: W + BA  (only A and B are trained)
Trainable params: 2 ร— d ร— r  vs  d ร— d
Example: r=16, d=4096 โ†’ 131K vs 16M (99% reduction!)

Key hyperparameters:

  • r (rank): 8โ€“64, higher = more expressive but more memory

  • lora_alpha: scaling factor, often set to r or 2r

  • target_modules: which layers to apply LoRA to (attention Q,K,V,O matrices)

  • lora_dropout: regularization, typically 0.05โ€“0.1

When to use: GPU with 16GB+ VRAM, 7Bโ€“13B models.

3.3 QLoRA (Quantized LoRA)ยถ

QLoRA combines LoRA with 4-bit quantization of the base model weights. This dramatically reduces VRAM requirements:

Full fine-tune 7B model: ~56GB VRAM
LoRA 7B model:           ~28GB VRAM
QLoRA 7B model:          ~6-8GB VRAM  โ† fits on a single consumer GPU!

How it works:

  1. Load base model in 4-bit (NF4 quantization via bitsandbytes)

  2. Add LoRA adapters in full precision (16-bit)

  3. Compute gradients in 16-bit but keep base weights in 4-bit

Quality trade-off: Typically 1โ€“3% performance drop vs full fine-tuning. Often acceptable for the 8ร— memory savings.

When to use: Consumer GPU (RTX 3090/4090), Colab, Kaggle. 7Bโ€“70B models.

3.4 DPO (Direct Preference Optimization)ยถ

After SFT, the model can generate responses but may not know which responses humans prefer. DPO trains on preference data (chosen vs. rejected response pairs) to align the model with human preferences.

Dataset format:

{
  "prompt": "Explain quantum entanglement",
  "chosen": "Quantum entanglement is a phenomenon where...",
  "rejected": "idk lol its complicated science stuff"
}

Why DPO over RLHF: DPO is simpler (no reward model, no PPO optimizer), more stable, and achieves competitive results. Itโ€™s the current industry standard for alignment fine-tuning.

When to use: After SFT, when you want to improve helpfulness, reduce harmful outputs, or enforce a specific tone/style.

3.5 RLHF (Reinforcement Learning from Human Feedback)ยถ

The original alignment technique used by OpenAI for InstructGPT/ChatGPT. More complex than DPO but more flexible:

  1. Train a reward model on preference pairs (predicts which response humans prefer)

  2. Run PPO (Proximal Policy Optimization) to maximize the reward modelโ€™s score

  3. KL penalty prevents the model from drifting too far from the SFT baseline

When to use: Large-scale alignment projects, when you need a reusable reward model, research settings.

Method Comparison Summaryยถ

Method

VRAM (7B)

Data Needed

Complexity

Quality

Full SFT

56GB

1Kโ€“100K

Low

Best

LoRA

28GB

1Kโ€“100K

Low

Great

QLoRA

6โ€“8GB

1Kโ€“100K

Low

Good

DPO

28GB (LoRA)

500โ€“10K pairs

Medium

Great

RLHF

80GB+

10K+ pairs + reward model

High

Great

4. Key Concepts Referenceยถ

PEFT (Parameter-Efficient Fine-Tuning)ยถ

An umbrella term for techniques that update only a small fraction of model parameters instead of all weights. LoRA, QLoRA, prefix tuning, and prompt tuning all fall under PEFT. The peft library from Hugging Face is the standard implementation.

Adaptersยถ

Small trainable modules inserted into a frozen pre-trained model. LoRA is the most popular adapter approach. After training, adapters can be merged back into base weights (zero inference overhead) or kept separate (swap adapters at runtime for different tasks).

Instruction Tuningยถ

Fine-tuning a model to follow natural language instructions. The dataset consists of (instruction, response) pairs, often augmented with a system prompt. This is what turns a raw base model (next-token predictor) into an assistant model (instruction follower). Examples: Alpaca, FLAN, Dolly.

Chat Templatesยถ

Standardized formats that structure multi-turn conversations. Each model family has its own template:

# Llama 3.x
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Hello!<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

# ChatML (Qwen, etc.)
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant

Alignmentยถ

The process of making a modelโ€™s behavior match human values and intentions. Alignment fine-tuning (DPO/RLHF) reduces harmful outputs, improves helpfulness, and enforces desired communication styles. Misaligned models may be technically capable but unsafe or unhelpful.

Catastrophic Forgettingยถ

A risk where fine-tuning on new data causes the model to forget previously learned capabilities. Mitigated by:

  • Using LoRA/PEFT (base weights remain frozen)

  • Including diverse data in the fine-tuning set

  • Using lower learning rates

  • Regularization techniques (EWC, replay buffers)

Tokenization and Special Tokensยถ

Each model uses a different tokenizer and different special tokens (<|eot_id|>, <|im_end|>, </s>, etc.). Always use the tokenizer that came with the model. When fine-tuning, ensure labels are set to -100 for prompt tokens so the model only learns to predict the response.

Gradient Checkpointingยถ

A memory optimization that recomputes activations during the backward pass instead of storing them. Reduces VRAM usage by ~40% at the cost of ~20% more compute. Almost always enabled during fine-tuning.

Flash Attentionยถ

An optimized attention implementation (FlashAttention-2) that dramatically reduces memory usage and increases speed for long sequences. Enable with attn_implementation="flash_attention_2" in Hugging Face models.

6. Hardware Requirements Guideยถ

GPU Memory Requirementsยถ

Rough rule: Model memory in GB โ‰ˆ (parameters in billions ร— bytes per param)

Full precision (fp32):  7B ร— 4 bytes = 28GB  (rare for inference)
Half precision (bf16):  7B ร— 2 bytes = 14GB  (standard)
8-bit quantization:     7B ร— 1 byte  =  7GB
4-bit quantization:     7B ร— 0.5 byte = ~4GB  (+ overhead โ‰ˆ 6GB total)

For fine-tuning, add:
  + Gradients:  ~same as model size (for trainable params)
  + Optimizer:  ~2ร— trainable params (Adam states)
  + Activations: depends on batch size and sequence length

Practical VRAM Guideยถ

VRAM

GPU Examples

What You Can Fine-tune

8GB

RTX 3070/3080, T4 (Colab)

QLoRA on 7B, LoRA on 1โ€“3B

16GB

RTX 3080 Ti, A4000, T4 (Colab Pro)

QLoRA on 7Bโ€“13B, LoRA on 7B

24GB

RTX 3090/4090, A5000

LoRA on 7Bโ€“13B, QLoRA on 30B

40GB

A100-40GB

LoRA on 30B, QLoRA on 70B

80GB

A100-80GB, H100

LoRA on 70B, Full fine-tune on 7B

2ร—80GB

2ร— H100

Full fine-tune on 13Bโ€“30B

8ร—80GB

8ร— H100

Full fine-tune on 70B

Free and Low-Cost Optionsยถ

Platform

Free VRAM

Time Limit

Best For

Google Colab (free)

T4 16GB

~4h/session

Learning, small experiments

Google Colab Pro

A100 40GB

~24h/session

Serious training

Kaggle Notebooks

T4 16GB ร—2

30h/week

Medium experiments

Lambda Cloud

A100 from $1.10/h

No limit

Production training

Vast.ai

RTX 4090 from $0.30/h

No limit

Cost-effective training

RunPod

A100 from $1.50/h

No limit

Production training

Memory Optimization Checklistยถ

When youโ€™re running out of VRAM, try these in order:

  1. Reduce batch size (use gradient accumulation to maintain effective batch size)

  2. Enable gradient checkpointing (gradient_checkpointing=True)

  3. Use QLoRA instead of LoRA (4-bit quantization)

  4. Reduce sequence length (max_seq_length)

  5. Use Flash Attention 2 (attn_implementation="flash_attention_2")

  6. Use bf16=True instead of fp32

  7. Reduce LoRA rank (r=8 instead of r=64)

  8. Use paged AdamW optimizer (optim="paged_adamw_8bit")

7. Learning Path Through Phase 12ยถ

This module is structured as a progressive sequence. Complete notebooks in order:

Notebook Sequenceยถ

#

Notebook

Topic

Prerequisites

00

00_START_HERE.ipynb

Overview (this notebook)

Phases 1โ€“11

01

01_dataset_preparation.ipynb

Data cleaning, formatting, splits

Basic Python, pandas

02

02_supervised_finetuning.ipynb

Full SFT with TRLโ€™s SFTTrainer

01 complete

03

03_lora_basics.ipynb

LoRA theory and practice

02 complete

04

04_qlora_training.ipynb

QLoRA on consumer hardware

03 complete

05

05_dpo_alignment.ipynb

DPO preference learning

02 complete

06

06_evaluation.ipynb

Evaluating fine-tuned models

02โ€“04 complete

07

07_deployment.ipynb

Merging, quantizing, serving

06 complete

Estimated Time Investmentยถ

Understanding (notebooks 00-01):    2โ€“4 hours
First fine-tune (notebooks 02-03):  4โ€“8 hours (includes training time)
QLoRA & DPO (notebooks 04-05):      4โ€“6 hours
Evaluation & Deployment (06-07):    3โ€“5 hours

Total: 13โ€“23 hours depending on experiments

Key Libraries You Will Useยถ

transformers   # Model loading, tokenization, training
peft           # LoRA, QLoRA, and other PEFT methods
trl            # SFTTrainer, DPOTrainer, PPO (built on top of transformers)
datasets       # Dataset loading, processing, formatting
bitsandbytes   # 4-bit and 8-bit quantization for QLoRA
accelerate     # Multi-GPU and mixed precision training
wandb          # Experiment tracking and visualization

8. Quick Environment Checkยถ

Before starting any fine-tuning work, verify that all required libraries are installed and your GPU (if any) has sufficient VRAM. The cells below check Python, PyTorch, transformers, PEFT, TRL, and related packages. They also report GPU specifications and recommend which fine-tuning methods your hardware supports. If you are on Google Colab or Kaggle, the free-tier T4 GPU (16GB) is sufficient for QLoRA on 7B models.

# Cell 1: Install core fine-tuning libraries
# Run this if you are in a fresh environment (e.g., Colab)

# !pip install -q transformers datasets peft trl bitsandbytes accelerate
# !pip install -q torch --index-url https://download.pytorch.org/whl/cu121

print("Uncomment the lines above if you need to install dependencies.")
print("For local environments, see the README for setup instructions.")
# Cell 2: Check library versions

import importlib
import sys

print(f"Python version: {sys.version.split()[0]}")
print()

libraries = [
    ("torch", "torch"),
    ("transformers", "transformers"),
    ("peft", "peft"),
    ("trl", "trl"),
    ("datasets", "datasets"),
    ("bitsandbytes", "bitsandbytes"),
    ("accelerate", "accelerate"),
    ("pandas", "pandas"),
    ("numpy", "numpy"),
]

print(f"{'Library':<20} {'Version':<15} {'Status'}")
print("-" * 50)

for lib_name, import_name in libraries:
    try:
        module = importlib.import_module(import_name)
        version = getattr(module, "__version__", "unknown")
        print(f"{lib_name:<20} {version:<15} OK")
    except ImportError:
        print(f"{lib_name:<20} {'NOT FOUND':<15} MISSING - run pip install {lib_name}")
# Cell 3: Check GPU availability and memory

import torch

print("=" * 50)
print("GPU / Hardware Report")
print("=" * 50)

# Check CUDA
cuda_available = torch.cuda.is_available()
print(f"CUDA available:     {cuda_available}")

if cuda_available:
    num_gpus = torch.cuda.device_count()
    print(f"Number of GPUs:     {num_gpus}")
    print()
    for i in range(num_gpus):
        props = torch.cuda.get_device_properties(i)
        total_mem = props.total_memory / 1024**3
        print(f"  GPU {i}: {props.name}")
        print(f"    Total VRAM:     {total_mem:.1f} GB")
        print(f"    CUDA Capability: {props.major}.{props.minor}")
        # Check what you can run
        if total_mem >= 40:
            print(f"    Can run:        LoRA on 30B, QLoRA on 70B")
        elif total_mem >= 24:
            print(f"    Can run:        LoRA on 7B-13B, QLoRA on 30B")
        elif total_mem >= 16:
            print(f"    Can run:        QLoRA on 7B-13B, LoRA on 7B")
        elif total_mem >= 8:
            print(f"    Can run:        QLoRA on 7B, LoRA on 1-3B")
        else:
            print(f"    Can run:        QLoRA on 1-3B only")
else:
    print()
    print("No GPU detected. You can still:")
    print("  - Run this notebook (CPU only)")
    print("  - Use Google Colab for GPU training (free T4)")
    print("  - Use Kaggle Notebooks (free T4 x2)")

# Check MPS (Apple Silicon)
if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    print()
    print("Apple Silicon MPS: Available")
    print("  Note: MPS supports fine-tuning but bitsandbytes (QLoRA) is not supported.")
    print("  Use LoRA without quantization on Apple Silicon.")

print()
print("PyTorch version:", torch.__version__)
print("CUDA version:", torch.version.cuda if torch.version.cuda else "N/A")
# Cell 4: Quick sanity check - verify transformers and datasets work

from transformers import AutoTokenizer
from datasets import load_dataset
import warnings
warnings.filterwarnings("ignore")

print("Testing transformers tokenizer...")
try:
    # Load a tiny tokenizer (no model weights, just tokenizer files)
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    text = "Fine-tuning LLMs is powerful!"
    tokens = tokenizer(text, return_tensors="pt")
    print(f"  Input:  '{text}'")
    print(f"  Tokens: {tokens['input_ids'].tolist()[0]}")
    print(f"  Length: {tokens['input_ids'].shape[1]} tokens")
    print("  transformers: OK")
except Exception as e:
    print(f"  transformers: ERROR - {e}")

print()
print("Testing datasets library...")
try:
    # Load a tiny slice of a public dataset
    dataset = load_dataset("databricks/databricks-dolly-15k", split="train[:5]")
    print(f"  Loaded sample dataset: {len(dataset)} examples")
    print(f"  Columns: {dataset.column_names}")
    print(f"  First example category: {dataset[0]['category']}")
    print("  datasets: OK")
except Exception as e:
    print(f"  datasets: ERROR - {e}")
    print("  (This may be a network issue. Try again or check your connection.)")

print()
print("Environment check complete!")
# Cell 5: Check PEFT and TRL availability

print("Testing PEFT (LoRA library)...")
try:
    from peft import LoraConfig, TaskType, get_peft_model
    # Create a sample LoRA config
    config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "v_proj"],
        lora_dropout=0.05,
        bias="none",
    )
    print(f"  LoRA config created: r={config.r}, alpha={config.lora_alpha}")
    print(f"  Target modules: {config.target_modules}")
    print("  peft: OK")
except Exception as e:
    print(f"  peft: ERROR - {e}")

print()
print("Testing TRL (training library)...")
try:
    from trl import SFTConfig, DPOConfig
    sft_config = SFTConfig(
        output_dir="/tmp/test",
        max_seq_length=512,
        num_train_epochs=1,
    )
    print(f"  SFTConfig created: max_seq_length={sft_config.max_seq_length}")
    print("  trl: OK")
except Exception as e:
    print(f"  trl: ERROR - {e}")

print()
print("All checks complete. You are ready to start fine-tuning!")

9. Fine-tuning Best Practicesยถ

Data Quality Rulesยถ

DO:

  • Start with 500โ€“2000 high-quality examples and scale from there

  • Manually review at least 50โ€“100 examples before training

  • Match your training data format exactly to how the model will be used

  • Include diverse examples covering edge cases

  • Use the modelโ€™s own chat template (from the tokenizer)

  • Set labels to -100 for prompt tokens (only train on response tokens)

DONโ€™T:

  • Use data scraped without cleaning (HTML tags, formatting artifacts, duplicates)

  • Mix response formats (some with markdown, some without)

  • Use extremely long or extremely short responses in the same batch

  • Include personally identifiable information (PII) in training data

  • Trust data quality without inspection

Training Hyperparametersยถ

Safe starting defaults for QLoRA fine-tuning:

learning_rate = 2e-4          # Higher than full fine-tune is OK for LoRA
num_train_epochs = 2-3        # Rarely need more than 3 epochs
per_device_train_batch_size = 2  # Start small, increase if VRAM allows
gradient_accumulation_steps = 8  # Effective batch = 2 ร— 8 = 16
max_seq_length = 2048          # Match to your data; longer = more VRAM
lr_scheduler_type = "cosine"   # Or "linear"
warmup_ratio = 0.05            # 5% warmup steps
weight_decay = 0.01

Evaluation Tipsยถ

  • Always hold out 10โ€“20% of data for validation

  • Monitor training loss AND validation loss (watch for overfitting)

  • Manually inspect model outputs, not just metrics

  • Use the same evaluation prompts throughout training for comparability

  • Compare against the base model and against GPT-4 if possible

10. Key Takeawaysยถ

What You Learned in This Notebookยถ

  1. Fine-tuning adapts pre-trained models to specific tasks without training from scratch โ€” a massive cost and time saving.

  2. Choose your approach wisely: Prompt engineering first, RAG for knowledge-intensive tasks, fine-tuning for behavior/style/format at scale.

  3. LoRA and QLoRA have democratized fine-tuning โ€” you can now fine-tune a 7B model on a single consumer GPU with QLoRA.

  4. SFT โ†’ DPO is the standard pipeline: First teach the model to follow instructions (SFT), then align it with preferences (DPO).

  5. Data quality > data quantity โ€” 1000 excellent examples beat 100,000 mediocre ones.

  6. Top 2026 models to fine-tune: Qwen 2.5 (multilingual), Phi-4 (efficiency), Llama 3.3 (general), Gemma 2 (research), DeepSeek R1 (reasoning).

Next Stepsยถ

You are ready to move on to the next notebook:

Next: 01_dataset_preparation.ipynb

You will learn how to:

  • Find and load datasets from Hugging Face Hub

  • Clean and filter data for quality

  • Format datasets for different model families

  • Create your own custom fine-tuning datasets

  • Split data into train/validation sets correctly

Zero-to-AI Curriculum โ€” Phase 12: LLM Fine-tuning
Last updated: February 2026