Supervised Fine-Tuning (SFT) β Complete WorkflowΒΆ
Phase 12 β LLM Fine-tuning | Notebook 02
What You Will LearnΒΆ
What supervised fine-tuning is and when to use it over prompting
Full fine-tuning vs parameter-efficient fine-tuning (PEFT) trade-offs
Setting up
SFTTrainerfrom the TRL libraryTraining on alpaca-style instruction-following datasets
Monitoring training with loss curves
Gradient accumulation and mixed-precision training
Using Flash Attention 2 for faster training
Saving and loading checkpoints
Complete working pipeline: data β trained model β inference
Troubleshooting common training failures
Model used: Qwen/Qwen2.5-1.5B-Instruct
Dataset: tatsu-lab/alpaca (52K instruction-following examples)
Estimated GPU: 16GB+ for full SFT in BF16; 8GB+ with gradient checkpointing
SFT Pipeline OverviewΒΆ
Raw Data Formatted Data Fine-tuned Model
βββββββββ ββββββββββββββ ββββββββββββββββ
{instruction, βββΊ <|system|> βββΊ Adapted weights
input, Follow instructions Lower loss on
output} <|user|> target tasks
{instruction} Consistent
<|assistant|> output format
{output}
SFT teaches a model to follow a specific style, format, and task β it does NOT add new knowledge. For new knowledge, use RAG.
# Install required libraries
# Run this cell once, then restart the kernel
!pip install -q transformers>=4.40.0 trl>=0.8.6 peft>=0.10.0 datasets>=2.18.0 accelerate>=0.29.0 bitsandbytes>=0.43.0 matplotlib torch
1. What Is SFT and When Should You Use It?ΒΆ
Supervised Fine-Tuning (SFT) is the process of continuing to train a pre-trained language model on a labeled dataset of (input, desired_output) pairs. The model has already learned general language understanding from pre-training; SFT steers it toward a specific behavior or task.
Decision FrameworkΒΆ
Situation |
Recommended Approach |
|---|---|
Task solvable with a good prompt |
Prompt engineering (no training cost) |
Need consistent output format |
Few-shot prompting or SFT |
Need specialized style (tone, domain jargon) |
SFT |
Have 100β100K labeled examples |
SFT |
Need to inject domain knowledge |
RAG (retrieval-augmented generation) |
Need aligned / safe behavior |
RLHF or DPO (after SFT) |
Limited GPU memory |
QLoRA (see Notebook 04) |
What SFT ChangesΒΆ
SFT optimizes the causal language modeling loss on your dataset:
\(x\) = the instruction/input tokens (context)
\(y_t\) = the expected output tokens
Only the output tokens contribute to the loss (the prompt tokens are masked)
Common SFT Use CasesΒΆ
Instruction following β teach a base model to follow commands politely
Domain adaptation β medical notes, legal documents, code in a specific style
Output format β always respond in JSON, always cite sources
Language/dialect β fine-tune for a specific language not well-covered in pre-training
Persona β customer support bot with specific brand voice
2. Full Fine-Tuning vs Parameter-Efficient Fine-Tuning (PEFT)ΒΆ
Method |
Trainable Params |
VRAM (7B model) |
Speed |
Quality |
Best For |
|---|---|---|---|---|---|
Full SFT |
100% (7B) |
~56 GB |
Slowest |
Highest |
Large compute budgets |
LoRA (r=16) |
~0.5% (35M) |
~16 GB |
3-5x faster |
Very close to full |
Most practical cases |
QLoRA (r=16) |
~0.5% (35M) |
~6 GB |
2-4x faster |
Slight quality gap |
Consumer GPUs |
Prefix Tuning |
<0.1% |
~14 GB |
Fast |
Lower |
Simple tasks |
Adapters |
~1β3% |
~16 GB |
Fast |
Good |
Multi-task learning |
IA3 |
<0.1% |
~14 GB |
Fastest |
Moderate |
Low-resource tasks |
This notebook demonstrates full SFT (all parameters updated) on a 1.5B model, which fits in 8β12 GB VRAM in BF16. For larger models, use LoRA (Notebook 03) or QLoRA (Notebook 04).
Key insight: Full SFT is only worth it if you have a large, high-quality dataset (>10K examples) AND sufficient compute. For most production use cases, LoRA achieves 95-99% of full SFT quality at 10x lower cost.
# Core imports and environment verification
import os
import sys
import json
import math
import warnings
from pathlib import Path
from typing import Dict, List, Optional
import torch
import transformers
import trl
import datasets
import peft
warnings.filterwarnings("ignore")
print("=" * 55)
print("Environment Check")
print("=" * 55)
print(f"Python : {sys.version.split()[0]}")
print(f"PyTorch : {torch.__version__}")
print(f"Transformers : {transformers.__version__}")
print(f"TRL : {trl.__version__}")
print(f"PEFT : {peft.__version__}")
print(f"Datasets : {datasets.__version__}")
print()
# GPU check
if torch.cuda.is_available():
device = "cuda"
gpu_name = torch.cuda.get_device_name(0)
vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"GPU : {gpu_name}")
print(f"VRAM : {vram_gb:.1f} GB")
print(f"BF16 supported : {torch.cuda.is_bf16_supported()}")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
device = "mps"
print("GPU : Apple Silicon (MPS)")
print("Note: BF16 not fully supported on MPS; will use FP32")
else:
device = "cpu"
print("GPU : Not available β running on CPU (very slow)")
print(f"\nSelected device : {device}")
# Determine optimal dtype
if device == "cuda" and torch.cuda.is_bf16_supported():
TORCH_DTYPE = torch.bfloat16
DTYPE_STR = "bfloat16"
else:
TORCH_DTYPE = torch.float16
DTYPE_STR = "float16"
print(f"Training dtype : {DTYPE_STR}")
print("=" * 55)
3. Dataset Preparation β Alpaca FormatΒΆ
The most widely used instruction-tuning format is Alpaca, originally from Stanford. Each example has three fields:
{
"instruction": "Describe the structure of an atom.",
"input": "",
"output": "An atom consists of a nucleus containing protons and neutrons..."
}
When input is non-empty, it provides additional context:
{
"instruction": "Translate the following sentence to French.",
"input": "The weather is beautiful today.",
"output": "Le temps est magnifique aujourd'hui."
}
Formatting StrategiesΒΆ
You must convert raw Alpaca fields into a single text string that the model sees. Two common templates:
Alpaca Prompt Template (classic):
### Instruction:
{instruction}
### Input:
{input}
### Response:
{output}
ChatML Template (modern, used by Qwen, Mistral-instruct):
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{instruction}\n{input}<|im_end|>
<|im_start|>assistant
{output}<|im_end|>
Critical: Always use the chat template that matches your base model. Mismatched templates reduce quality significantly.
Loss MaskingΒΆ
SFTTrainer automatically masks the prompt tokens so the model only learns to predict the response tokens. This prevents the model from βwastingβ capacity learning to predict the instruction text it already knows.
from datasets import load_dataset, DatasetDict
# Load the Stanford Alpaca dataset (52K instruction examples)
print("Loading tatsu-lab/alpaca dataset...")
raw_dataset = load_dataset("tatsu-lab/alpaca", split="train")
print(f"Total examples: {len(raw_dataset):,}")
print(f"Columns: {raw_dataset.column_names}")
print()
# Preview a few examples
print("Sample examples:")
print("-" * 60)
for i, example in enumerate(raw_dataset.select(range(3))):
print(f"\n[Example {i+1}]")
print(f" instruction : {example['instruction'][:80]}..." if len(example['instruction']) > 80 else f" instruction : {example['instruction']}")
print(f" input : {example['input'][:60]}" if example['input'] else " input : (empty)")
print(f" output : {example['output'][:80]}..." if len(example['output']) > 80 else f" output : {example['output']}")
# Formatting function: converts Alpaca fields β single training string
# We use the Alpaca prompt template (compatible with most models)
ALPACA_PROMPT_WITH_INPUT = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:
{output}"""
ALPACA_PROMPT_WITHOUT_INPUT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:
{output}"""
def format_alpaca(example: Dict) -> Dict:
"""Format a single Alpaca example into a training text string."""
if example.get("input", "").strip():
text = ALPACA_PROMPT_WITH_INPUT.format(
instruction=example["instruction"],
input=example["input"],
output=example["output"],
)
else:
text = ALPACA_PROMPT_WITHOUT_INPUT.format(
instruction=example["instruction"],
output=example["output"],
)
return {"text": text}
# Apply formatting to the full dataset
formatted_dataset = raw_dataset.map(
format_alpaca,
remove_columns=raw_dataset.column_names, # drop original columns
desc="Formatting Alpaca examples",
)
# Split into train / validation
split = formatted_dataset.train_test_split(test_size=0.02, seed=42)
train_dataset = split["train"]
eval_dataset = split["test"]
print(f"Train examples : {len(train_dataset):,}")
print(f"Eval examples : {len(eval_dataset):,}")
print()
print("Sample formatted text:")
print("-" * 60)
print(train_dataset[0]["text"][:500])
print("...")
# Analyze token length distribution to choose max_seq_length
# This helps avoid truncating too many examples or padding wastefully
from transformers import AutoTokenizer
import matplotlib
matplotlib.use("Agg") # non-interactive backend for notebook
import matplotlib.pyplot as plt
import numpy as np
MODEL_ID = "Qwen/Qwen2.5-1.5B-Instruct"
print(f"Loading tokenizer: {MODEL_ID}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
# Set pad token (required for batch training)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
print(f"Set pad_token = eos_token ({tokenizer.eos_token!r})")
print(f"Vocabulary size : {tokenizer.vocab_size:,}")
print(f"Max model length: {tokenizer.model_max_length}")
print()
# Sample 2000 examples to estimate lengths (faster than full dataset)
sample_size = min(2000, len(train_dataset))
sample_texts = train_dataset.select(range(sample_size))["text"]
lengths = [
len(tokenizer.encode(t, add_special_tokens=True))
for t in sample_texts
]
lengths_arr = np.array(lengths)
print(f"Token length statistics (n={sample_size}):")
print(f" min : {lengths_arr.min()}")
print(f" mean : {lengths_arr.mean():.1f}")
print(f" median : {np.median(lengths_arr):.1f}")
print(f" p95 : {np.percentile(lengths_arr, 95):.1f}")
print(f" p99 : {np.percentile(lengths_arr, 99):.1f}")
print(f" max : {lengths_arr.max()}")
print()
p95 = int(np.percentile(lengths_arr, 95))
MAX_SEQ_LENGTH = min(2048, max(512, p95 + 64)) # round up, cap at 2048
print(f"Recommended max_seq_length: {MAX_SEQ_LENGTH}")
print(f"(covers {(lengths_arr <= MAX_SEQ_LENGTH).mean() * 100:.1f}% of examples without truncation)")
# Plot distribution
fig, ax = plt.subplots(figsize=(9, 4))
ax.hist(lengths_arr, bins=60, color="steelblue", edgecolor="white", alpha=0.85)
ax.axvline(MAX_SEQ_LENGTH, color="firebrick", linestyle="--", linewidth=2,
label=f"max_seq_length = {MAX_SEQ_LENGTH}")
ax.axvline(np.median(lengths_arr), color="darkorange", linestyle="-", linewidth=2,
label=f"median = {np.median(lengths_arr):.0f}")
ax.set_xlabel("Token count per example", fontsize=12)
ax.set_ylabel("Number of examples", fontsize=12)
ax.set_title("Token Length Distribution β Alpaca Dataset", fontsize=14)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("token_length_distribution.png", dpi=120)
plt.show()
print("Saved: token_length_distribution.png")
4. Loading the Base ModelΒΆ
Memory Requirements at Different PrecisionsΒΆ
A rough rule of thumb: N billion parameters β N Γ multiplier GB VRAM
Precision |
Bytes/param |
1.5B model |
7B model |
13B model |
70B model |
|---|---|---|---|---|---|
FP32 |
4 bytes |
6 GB |
28 GB |
52 GB |
280 GB |
BF16/FP16 |
2 bytes |
3 GB |
14 GB |
26 GB |
140 GB |
INT8 |
1 byte |
1.5 GB |
7 GB |
13 GB |
70 GB |
NF4 (QLoRA) |
0.5 bytes |
0.75 GB |
3.5 GB |
6.5 GB |
35 GB |
For training, you also need memory for:
Gradients: same size as the model
Optimizer states (AdamW): 2Γ the model size (momentum + variance)
Activations: depends on batch size and sequence length
Total VRAM for full SFT in BF16 (7B model): ~14 (model) + 14 (gradients) + 28 (optimizer) = ~56 GB
This is why we use a 1.5B model for full SFT demonstrations and QLoRA for larger models.
Gradient CheckpointingΒΆ
Gradient checkpointing trades compute for memory. Instead of storing all intermediate activations, it recomputes them during the backward pass. This reduces activation memory by ~5-10x at the cost of ~20% slower training. Always enable it for full SFT.
from transformers import AutoModelForCausalLM, AutoTokenizer
def get_gpu_memory_gb() -> float:
"""Return currently allocated GPU memory in GB."""
if torch.cuda.is_available():
return torch.cuda.memory_allocated() / 1e9
return 0.0
def count_parameters(model) -> Dict:
"""Count total and trainable parameters."""
total = sum(p.numel() for p in model.parameters())
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
return {
"total": total,
"trainable": trainable,
"pct": trainable / total * 100,
}
print(f"Loading model: {MODEL_ID}")
print(f"Dtype: {DTYPE_STR}")
mem_before = get_gpu_memory_gb()
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=TORCH_DTYPE, # BF16 or FP16 depending on GPU
device_map="auto", # automatically place layers across available GPUs
trust_remote_code=True, # required for Qwen models
attn_implementation="eager", # use "flash_attention_2" if flash-attn installed
)
mem_after = get_gpu_memory_gb()
# Enable gradient checkpointing to save activation memory
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})
# Enable input requires_grad for gradient checkpointing compatibility
model.enable_input_require_grads()
params = count_parameters(model)
print()
print(f"Model loaded successfully")
print(f" Total parameters : {params['total']:,} ({params['total']/1e9:.2f}B)")
print(f" Trainable params : {params['trainable']:,} ({params['pct']:.1f}%)")
print(f" GPU memory (model) : {mem_after - mem_before:.2f} GB")
print(f" Model dtype : {next(model.parameters()).dtype}")
print(f" Device map : {model.hf_device_map if hasattr(model, 'hf_device_map') else device}")
5. Training Hyperparameter GuideΒΆ
Choosing the right hyperparameters is critical. Here is a practical guide:
Learning RateΒΆ
Scenario |
Suggested LR |
|---|---|
Full SFT (all weights) |
|
LoRA adapters only |
|
Very large dataset (>100K) |
|
Small dataset (<1K) |
|
Too high: loss explodes, NaN gradients
Too low: very slow convergence, may never reach good performance
Rule: start with
2e-5for full SFT; tune if needed
Batch Size and Gradient AccumulationΒΆ
The effective batch size = per_device_train_batch_size Γ gradient_accumulation_steps Γ num_gpus
Target effective batch size: 32β128 for instruction tuning.
8 GB GPU β per_device=1, grad_accum=32 β effective=32
16 GB GPU β per_device=4, grad_accum=8 β effective=32
80 GB GPU β per_device=16, grad_accum=2 β effective=32
WarmupΒΆ
Always use a warmup period (5-10% of total steps). This prevents early training instability when gradients are large and noisy.
LR SchedulerΒΆ
cosine: best general choice β smoothly decays LR to near zerolinear: simple alternative, slightly worseconstant_with_warmup: good for short runs
Epochs vs StepsΒΆ
For large datasets (>10K examples), 1-3 epochs is usually enough. Over-training causes catastrophic forgetting (the model loses general knowledge). Monitor validation loss and stop when it starts increasing.
Weight DecayΒΆ
Use weight_decay=0.01 or 0.1 to regularize. Do not apply to bias/norm parameters (Hugging Face handles this automatically).
Max Gradient NormΒΆ
max_grad_norm=1.0 β clip gradients to prevent spikes. If you see NaN losses, try 0.3.
# Demonstrate effective batch size calculation
print("Effective Batch Size Calculator")
print("=" * 50)
configs = [
{"vram": "4 GB", "per_device": 1, "grad_accum": 32, "n_gpus": 1},
{"vram": "8 GB", "per_device": 2, "grad_accum": 16, "n_gpus": 1},
{"vram": "16 GB", "per_device": 4, "grad_accum": 8, "n_gpus": 1},
{"vram": "24 GB", "per_device": 8, "grad_accum": 4, "n_gpus": 1},
{"vram": "2Γ24 GB","per_device": 8,"grad_accum": 4, "n_gpus": 2},
{"vram": "8Γ80 GB","per_device": 32,"grad_accum": 1, "n_gpus": 8},
]
print(f"{'Hardware':<12} {'per_device':>10} {'grad_accum':>10} {'n_gpus':>7} {'effective_bs':>13}")
print("-" * 60)
for c in configs:
effective = c["per_device"] * c["grad_accum"] * c["n_gpus"]
print(f"{c['vram']:<12} {c['per_device']:>10} {c['grad_accum']:>10} {c['n_gpus']:>7} {effective:>13}")
print()
print("Mixed precision comparison:")
print("-" * 40)
model_params = 1.5e9 # 1.5B parameters
for dtype, bytes_per_param in [("FP32", 4), ("BF16/FP16", 2), ("INT8", 1), ("NF4", 0.5)]:
model_gb = model_params * bytes_per_param / 1e9
grad_gb = model_params * bytes_per_param / 1e9 # same as model (full SFT)
optim_gb = model_params * 8 / 1e9 # AdamW: 2 FP32 states
total_gb = model_gb + grad_gb + optim_gb
print(f"{dtype:<10}: model={model_gb:.1f}GB grad={grad_gb:.1f}GB optim={optim_gb:.1f}GB totalβ{total_gb:.1f}GB")
6. Flash Attention 2ΒΆ
Flash Attention 2 is a memory-efficient, IO-aware exact attention algorithm by Tri Dao (2023). It:
Reduces attention memory from O(NΒ²) to O(N) in VRAM
Achieves 2-4x speedup over standard attention for long sequences
Is mathematically equivalent (exact, not approximate)
Requires: Ampere+ GPU (RTX 3000s, A100, H100), CUDA 11.6+
When to Use ItΒΆ
Sequence Length |
Speedup |
Memory Savings |
|---|---|---|
512 tokens |
~1.2x |
~20% |
1024 tokens |
~1.8x |
~40% |
2048 tokens |
~2.5x |
~60% |
4096+ tokens |
~4x+ |
~75%+ |
InstallationΒΆ
pip install flash-attn --no-build-isolation
# Or via conda:
conda install -c conda-forge flash-attn
Compilation takes 10-15 minutes on first install.
# Check if Flash Attention 2 is available and demonstrate how to enable it
def check_flash_attention() -> bool:
"""Check if flash_attn is installed and the GPU supports it."""
try:
import flash_attn
fa_version = flash_attn.__version__
if not torch.cuda.is_available():
print("Flash Attention requires a CUDA GPU β not available.")
return False
major, _ = torch.cuda.get_device_capability()
if major < 8: # Ampere is SM80
print(f"Flash Attention requires Ampere GPU (SM80+). Current: SM{major}0")
return False
print(f"Flash Attention {fa_version} available β GPU is compatible.")
return True
except ImportError:
print("flash_attn not installed. Install with: pip install flash-attn --no-build-isolation")
return False
USE_FLASH_ATTN = check_flash_attention()
ATTN_IMPL = "flash_attention_2" if USE_FLASH_ATTN else "eager"
print(f"\nUsing attention implementation: {ATTN_IMPL}")
print()
print("To load a model with Flash Attention 2:")
print("-" * 50)
print("""
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2", # <-- enable FA2
)
# Note: Flash Attention 2 requires bfloat16 or float16 (not float32)
# Note: Not all model architectures support FA2 (check HF docs)
""")
7. Setting Up SFTTrainerΒΆ
SFTTrainer is the Hugging Face TRL libraryβs specialized trainer for supervised fine-tuning. It extends the base Trainer with:
Automatic response masking (instruction tokens do not contribute to loss)
Sequence packing (combine short examples to fill context window β 2-3x throughput)
Dataset formatting utilities
LoRA/QLoRA integration
Built-in logging of training/eval loss
Key ParametersΒΆ
Parameter |
Description |
Typical Value |
|---|---|---|
|
Column name containing the formatted text |
|
|
Truncate sequences longer than this |
512β2048 |
|
Pack multiple short examples into one |
|
|
CPU workers for tokenization |
4 |
|
Number of passes over the data |
1β3 |
|
Batch size per GPU |
1β8 |
|
Steps before optimizer update |
8β32 |
|
Peak LR (cosine schedule) |
|
|
Fraction of steps for warmup |
|
|
Use BF16 mixed precision |
|
|
Use FP16 mixed precision |
|
|
Save checkpoint every N steps |
100β500 |
|
Log metrics every N steps |
10β50 |
from transformers import TrainingArguments
from trl import SFTTrainer, SFTConfig
OUTPUT_DIR = "./sft-qwen2.5-1.5b-alpaca"
# -----------------------------------------------------------------------
# Training arguments β each parameter is documented
# -----------------------------------------------------------------------
training_args = SFTConfig(
# --- Output ---
output_dir=OUTPUT_DIR, # where to save checkpoints + logs
run_name="qwen2.5-sft-alpaca", # experiment name for logging
# --- Dataset ---
max_seq_length=MAX_SEQ_LENGTH, # truncate sequences beyond this
dataset_text_field="text", # column in dataset containing text
packing=True, # pack short examples β higher throughput
dataset_num_proc=4, # CPU workers for tokenization
# --- Batch size / accumulation ---
per_device_train_batch_size=2, # per-GPU batch size (lower if OOM)
per_device_eval_batch_size=2,
gradient_accumulation_steps=8, # effective batch = 2 Γ 8 = 16
# --- Optimizer ---
optim="adamw_torch_fused", # fused AdamW (faster on CUDA)
learning_rate=2e-5, # peak LR for full SFT
weight_decay=0.01, # L2 regularization
max_grad_norm=1.0, # gradient clipping threshold
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-8,
# --- LR schedule ---
num_train_epochs=1, # 1 epoch for demo; use 2-3 for production
lr_scheduler_type="cosine", # cosine decay (recommended)
warmup_ratio=0.05, # 5% of steps for warmup
# --- Mixed precision ---
bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(),
fp16=torch.cuda.is_available() and not torch.cuda.is_bf16_supported(),
# --- Memory optimizations ---
gradient_checkpointing=True, # trade compute for memory
gradient_checkpointing_kwargs={"use_reentrant": False},
dataloader_pin_memory=True, # faster data transfer to GPU
dataloader_num_workers=2,
# --- Checkpointing ---
save_strategy="steps",
save_steps=200,
save_total_limit=3, # keep only last 3 checkpoints
load_best_model_at_end=True, # load checkpoint with best eval loss
# --- Evaluation ---
eval_strategy="steps",
eval_steps=200,
# --- Logging ---
logging_steps=10,
logging_first_step=True,
report_to="none", # change to "wandb" or "tensorboard" for tracking
# --- Reproducibility ---
seed=42,
data_seed=42,
)
print("SFTConfig created successfully.")
print(f" Output dir : {training_args.output_dir}")
print(f" Epochs : {training_args.num_train_epochs}")
print(f" Batch (per device) : {training_args.per_device_train_batch_size}")
print(f" Grad accumulation : {training_args.gradient_accumulation_steps}")
print(f" Effective batch : {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f" Learning rate : {training_args.learning_rate}")
print(f" LR scheduler : {training_args.lr_scheduler_type}")
print(f" BF16 : {training_args.bf16}")
print(f" FP16 : {training_args.fp16}")
print(f" Grad checkpointing : {training_args.gradient_checkpointing}")
print(f" Max seq length : {training_args.max_seq_length}")
# Initialize SFTTrainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
# Check how many training steps will run
if hasattr(trainer, 'args') and hasattr(trainer.args, 'max_steps'):
print(f"Max steps : {trainer.args.max_steps}")
# Show model summary
print()
print("SFTTrainer initialized.")
print(f" Train dataset size : {len(train_dataset):,}")
print(f" Eval dataset size : {len(eval_dataset):,}")
print(f" Trainable params : {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
print()
print("Ready to train. Run trainer.train() to start.")
8. Training and Monitoring LossΒΆ
What to Watch During TrainingΒΆ
Good training looks like:
Train loss decreases smoothly from ~2.5 to ~1.0 (for instruction tuning)
Eval loss closely tracks train loss (small gap = good generalization)
No NaN losses or sudden spikes
Warning signs:
Symptom |
Likely Cause |
Fix |
|---|---|---|
Loss goes NaN |
LR too high, mixed precision issue |
Lower LR, try FP16βBF16 |
Loss stuck at ~2.5 |
LR too low, bad data format |
Increase LR, check formatting |
Eval loss rises while train falls |
Overfitting |
Reduce epochs, add dropout, smaller dataset |
OOM error |
Batch size too large |
Reduce batch size, enable gradient checkpointing |
Very slow |
No mixed precision |
Enable bf16=True or fp16=True |
Interpreting Loss ValuesΒΆ
For causal LM on instruction data:
Initial loss: ~2.0β3.0 (depends on model and data)
After 1 epoch: ~1.0β1.5 (converging)
After 3 epochs: ~0.8β1.2 (well-trained)
Below 0.5: Likely memorizing training data (overfit)
import time
print("Starting training...")
print("Note: On CPU this will be very slow. Use a GPU for meaningful training.")
print("=" * 55)
mem_before_train = get_gpu_memory_gb()
start_time = time.time()
# Run training
train_result = trainer.train()
elapsed = time.time() - start_time
mem_after_train = get_gpu_memory_gb()
print()
print("Training complete!")
print(f" Total time : {elapsed/60:.1f} min")
print(f" Train loss (final) : {train_result.training_loss:.4f}")
print(f" Samples/sec : {train_result.metrics.get('train_samples_per_second', 'N/A')}")
print(f" GPU memory used : {mem_after_train:.2f} GB")
# Plot training and evaluation loss curves
def extract_loss_history(trainer_state):
"""Extract train and eval loss from trainer log history."""
train_steps, train_losses = [], []
eval_steps, eval_losses = [], []
for entry in trainer_state.log_history:
if "loss" in entry and "eval_loss" not in entry:
train_steps.append(entry["step"])
train_losses.append(entry["loss"])
if "eval_loss" in entry:
eval_steps.append(entry["step"])
eval_losses.append(entry["eval_loss"])
return train_steps, train_losses, eval_steps, eval_losses
train_steps, train_losses, eval_steps, eval_losses = extract_loss_history(trainer.state)
if train_losses:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(train_steps, train_losses, color="steelblue", linewidth=1.5,
alpha=0.8, label="Train loss")
if eval_losses:
ax.plot(eval_steps, eval_losses, color="firebrick", linewidth=2.0,
marker="o", markersize=5, label="Eval loss")
# Smoothed train loss (moving average)
if len(train_losses) > 10:
window = max(1, len(train_losses) // 20)
smoothed = np.convolve(train_losses, np.ones(window)/window, mode="valid")
smooth_steps = train_steps[window-1:]
ax.plot(smooth_steps, smoothed, color="navy", linewidth=2.5,
label=f"Train loss (smoothed, w={window})")
ax.set_xlabel("Training Steps", fontsize=12)
ax.set_ylabel("Cross-Entropy Loss", fontsize=12)
ax.set_title("SFT Training β Loss Curves", fontsize=14)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("sft_loss_curves.png", dpi=120)
plt.show()
print("Saved: sft_loss_curves.png")
else:
print("No loss history available (training may not have completed).")
print("Generating synthetic loss curve for illustration...")
# Illustrative synthetic loss curve
steps = np.arange(1, 501)
train_loss_synth = 2.5 * np.exp(-steps / 150) + 0.9 + 0.1 * np.random.randn(500) * np.exp(-steps / 300)
eval_loss_synth = 2.4 * np.exp(-steps[::20] / 150) + 0.95 + 0.05 * np.random.randn(25)
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(steps, train_loss_synth, color="steelblue", linewidth=1.2, alpha=0.7, label="Train loss")
ax.plot(steps[::20], eval_loss_synth, color="firebrick", linewidth=2.0,
marker="o", markersize=5, label="Eval loss")
ax.set_xlabel("Training Steps", fontsize=12)
ax.set_ylabel("Cross-Entropy Loss", fontsize=12)
ax.set_title("SFT Training β Loss Curves (Illustrative)", fontsize=14)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("sft_loss_curves.png", dpi=120)
plt.show()
9. Saving CheckpointsΒΆ
What Gets SavedΒΆ
When you call trainer.save_model(), the following files are written:
sft-qwen2.5-1.5b-alpaca/
βββ config.json # model architecture config
βββ tokenizer.json # tokenizer vocabulary
βββ tokenizer_config.json
βββ special_tokens_map.json
βββ model.safetensors # model weights (safetensors format)
β OR
βββ pytorch_model.bin # model weights (older PyTorch format)
βββ training_args.bin # training configuration
Checkpoint vs Final ModelΒΆ
Checkpoint (
checkpoint-200/,checkpoint-400/): Intermediate saves during training. Includes optimizer state for resuming.Final model (root output dir): The final weights after all training. Smaller because optimizer state is not included.
Resuming from CheckpointΒΆ
trainer.train(resume_from_checkpoint="./output_dir/checkpoint-200")
# or to resume from the latest checkpoint automatically:
trainer.train(resume_from_checkpoint=True)
import os
# Save the final model and tokenizer
print("Saving fine-tuned model...")
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
# Save training metrics
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()
# List saved files
print()
print(f"Files saved to: {OUTPUT_DIR}/")
saved_files = sorted(os.listdir(OUTPUT_DIR))
for f in saved_files:
full_path = os.path.join(OUTPUT_DIR, f)
if os.path.isfile(full_path):
size_mb = os.path.getsize(full_path) / 1e6
print(f" {f:<45} {size_mb:>8.1f} MB")
else:
print(f" {f}/ (directory)")
print()
total_mb = sum(
os.path.getsize(os.path.join(OUTPUT_DIR, f)) / 1e6
for f in os.listdir(OUTPUT_DIR)
if os.path.isfile(os.path.join(OUTPUT_DIR, f))
)
print(f"Total model size: {total_mb:.0f} MB ({total_mb/1024:.2f} GB)")
# Load the fine-tuned model and run inference
from transformers import AutoModelForCausalLM, AutoTokenizer
print("Loading fine-tuned model for inference...")
ft_tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR, trust_remote_code=True)
ft_model = AutoModelForCausalLM.from_pretrained(
OUTPUT_DIR,
torch_dtype=TORCH_DTYPE,
device_map="auto",
trust_remote_code=True,
)
ft_model.eval()
def generate_response(model, tokenizer, instruction: str, input_text: str = "",
max_new_tokens: int = 256, temperature: float = 0.7) -> str:
"""Generate a response using the fine-tuned model."""
if input_text.strip():
prompt = ALPACA_PROMPT_WITH_INPUT.format(
instruction=instruction,
input=input_text,
output="", # empty β model will complete this
)
else:
prompt = ALPACA_PROMPT_WITHOUT_INPUT.format(
instruction=instruction,
output="",
)
# Remove the trailing empty "### Response:" text to get clean prompt
prompt = prompt.rstrip()
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[1]
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=temperature,
do_sample=temperature > 0,
top_p=0.9,
repetition_penalty=1.1,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
# Decode only the newly generated tokens
new_tokens = outputs[0][input_len:]
return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
# Test with a few examples
test_cases = [
{"instruction": "Explain what photosynthesis is in simple terms.", "input": ""},
{"instruction": "Write a haiku about programming.", "input": ""},
{"instruction": "Convert this temperature from Celsius to Fahrenheit.", "input": "100Β°C"},
]
print("=" * 60)
print("Fine-tuned Model Inference")
print("=" * 60)
for i, tc in enumerate(test_cases):
print(f"\n[Test {i+1}]")
print(f"Instruction: {tc['instruction']}")
if tc['input']:
print(f"Input: {tc['input']}")
response = generate_response(ft_model, ft_tokenizer, tc["instruction"], tc["input"])
print(f"Response: {response}")
print("-" * 60)
10. Troubleshooting Common IssuesΒΆ
CUDA Out of Memory (OOM)ΒΆ
RuntimeError: CUDA out of memory. Tried to allocate X GB.
Solutions (try in order):
Reduce
per_device_train_batch_sizeto 1Increase
gradient_accumulation_stepsto compensateEnable
gradient_checkpointing=TrueReduce
max_seq_length(e.g., 512 instead of 2048)Use a smaller model or switch to QLoRA (see Notebook 04)
# Quick OOM debug: clear cache and check memory
torch.cuda.empty_cache()
print(torch.cuda.memory_summary())
Loss is NaNΒΆ
{'loss': nan, 'learning_rate': ...}
Solutions:
Lower learning rate by 10x
Add gradient clipping:
max_grad_norm=0.3Switch from FP16 to BF16 (FP16 has smaller dynamic range)
Check for bad data (empty outputs, very long sequences)
Slow TrainingΒΆ
Solutions:
Enable BF16/FP16 mixed precision
Set
dataloader_num_workers=4Enable
dataloader_pin_memory=TrueUse
optim="adamw_torch_fused"Install Flash Attention 2
Enable sequence packing (
packing=True)
Model Generates Garbage After TrainingΒΆ
Solutions:
Check prompt template matches training format exactly
Verify pad_token is set correctly
Ensure EOS token is included at end of outputs in training data
Check if model was over-trained (too many epochs)
Verify dataset quality β bad data = bad output
Training Loss Not DecreasingΒΆ
Solutions:
Check learning rate is not too low
Verify data formatting is correct (outputs not empty)
Check that tokenizer chat template matches model
Ensure gradient checkpointing is compatible with your model version
ImportError / Version ConflictsΒΆ
# Known compatible versions (as of early 2025)
pip install transformers==4.40.0 trl==0.8.6 peft==0.10.0 \
accelerate==0.29.3 bitsandbytes==0.43.1 datasets==2.18.0
# Free GPU memory after training (useful in notebooks)
import gc
def free_memory(*objects):
"""Delete objects and clear GPU cache."""
for obj in objects:
del obj
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.synchronize()
print("GPU memory freed.")
if torch.cuda.is_available():
allocated = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
print(f" Allocated: {allocated:.2f} GB")
print(f" Reserved : {reserved:.2f} GB")
# Uncomment to free memory:
# free_memory(model, trainer, ft_model, ft_tokenizer)
print("Memory management utilities ready.")
print("Call free_memory(model, trainer) when you are done to reclaim VRAM.")
Summary β Complete SFT PipelineΒΆ
Step 1: Load raw dataset
load_dataset("tatsu-lab/alpaca")
Step 2: Format with prompt template
format_alpaca(example) β {"text": "..."}
Step 3: Analyze token lengths
Choose max_seq_length to cover 95th percentile
Step 4: Load model in BF16
AutoModelForCausalLM.from_pretrained(..., torch_dtype=bfloat16)
model.gradient_checkpointing_enable()
Step 5: Configure SFTConfig
learning_rate=2e-5, gradient_accumulation_steps=8
bf16=True, packing=True
Step 6: Create SFTTrainer and train
trainer = SFTTrainer(model, tokenizer, args, train_dataset, eval_dataset)
trainer.train()
Step 7: Monitor loss curves
Watch for convergence, avoid overfitting
Step 8: Save and test
trainer.save_model(output_dir)
Generate sample responses
Key TakeawaysΒΆ
Full SFT updates all model weights β powerful but memory-intensive
Gradient checkpointing is essential for training on limited VRAM
Mixed precision (BF16) cuts memory nearly in half with no quality loss
Gradient accumulation lets you achieve large effective batch sizes on small GPUs
Sequence packing dramatically improves training throughput (2-3x)
Monitor eval loss β stop training when it starts to rise
Flash Attention 2 gives 2-4x speedup for long sequences on Ampere+ GPUs
Next StepsΒΆ
Notebook 03 β LoRA Basics: reduce trainable parameters to ~0.5% for similar quality
Notebook 04 β QLoRA: fine-tune 7B+ models on a single consumer GPU
Notebook 05 β Instruction Tuning: curate better instruction datasets
Notebook 07 β RLHF/DPO: align model behavior with human preferences
References: