Run this notebook: Open in Colab Open in Kaggle

Supervised Fine-Tuning (SFT) — Complete Workflow¶

Phase 12 — LLM Fine-tuning | Notebook 02

What You Will Learn¶

What supervised fine-tuning is and when to use it over prompting
Full fine-tuning vs parameter-efficient fine-tuning (PEFT) trade-offs
Setting up SFTTrainer from the TRL library
Training on alpaca-style instruction-following datasets
Monitoring training with loss curves
Gradient accumulation and mixed-precision training
Using Flash Attention 2 for faster training
Saving and loading checkpoints
Complete working pipeline: data → trained model → inference
Troubleshooting common training failures

Model used: Qwen/Qwen2.5-1.5B-Instruct
Dataset: tatsu-lab/alpaca (52K instruction-following examples)
Estimated GPU: 16GB+ for full SFT in BF16; 8GB+ with gradient checkpointing

SFT Pipeline Overview¶

Raw Data                Formatted Data           Fine-tuned Model
─────────               ──────────────           ────────────────
{instruction,    ──►   <|system|>          ──►   Adapted weights
 input,                Follow instructions        Lower loss on
 output}               <|user|>                   target tasks
                        {instruction}              Consistent
                        <|assistant|>              output format
                        {output}

SFT teaches a model to follow a specific style, format, and task — it does NOT add new knowledge. For new knowledge, use RAG.

# Install required libraries
# Run this cell once, then restart the kernel
!pip install -q transformers>=4.40.0 trl>=0.8.6 peft>=0.10.0 datasets>=2.18.0 accelerate>=0.29.0 bitsandbytes>=0.43.0 matplotlib torch

1. What Is SFT and When Should You Use It?¶

Supervised Fine-Tuning (SFT) is the process of continuing to train a pre-trained language model on a labeled dataset of (input, desired_output) pairs. The model has already learned general language understanding from pre-training; SFT steers it toward a specific behavior or task.

Decision Framework¶

Situation	Recommended Approach
Task solvable with a good prompt	Prompt engineering (no training cost)
Need consistent output format	Few-shot prompting or SFT
Need specialized style (tone, domain jargon)	SFT
Have 100–100K labeled examples	SFT
Need to inject domain knowledge	RAG (retrieval-augmented generation)
Need aligned / safe behavior	RLHF or DPO (after SFT)
Limited GPU memory	QLoRA (see Notebook 04)

What SFT Changes¶

SFT optimizes the causal language modeling loss on your dataset:

\[\mathcal{L}_{SFT} = -\sum_{t} \log P_{\theta}(y_t \mid x, y_{<t})\]

\(x\) = the instruction/input tokens (context)
\(y_t\) = the expected output tokens
Only the output tokens contribute to the loss (the prompt tokens are masked)

Common SFT Use Cases¶

Instruction following — teach a base model to follow commands politely
Domain adaptation — medical notes, legal documents, code in a specific style
Output format — always respond in JSON, always cite sources
Language/dialect — fine-tune for a specific language not well-covered in pre-training
Persona — customer support bot with specific brand voice

2. Full Fine-Tuning vs Parameter-Efficient Fine-Tuning (PEFT)¶

Method	Trainable Params	VRAM (7B model)	Speed	Quality	Best For
Full SFT	100% (7B)	~56 GB	Slowest	Highest	Large compute budgets
LoRA (r=16)	~0.5% (35M)	~16 GB	3-5x faster	Very close to full	Most practical cases
QLoRA (r=16)	~0.5% (35M)	~6 GB	2-4x faster	Slight quality gap	Consumer GPUs
Prefix Tuning	<0.1%	~14 GB	Fast	Lower	Simple tasks
Adapters	~1–3%	~16 GB	Fast	Good	Multi-task learning
IA3	<0.1%	~14 GB	Fastest	Moderate	Low-resource tasks

This notebook demonstrates full SFT (all parameters updated) on a 1.5B model, which fits in 8–12 GB VRAM in BF16. For larger models, use LoRA (Notebook 03) or QLoRA (Notebook 04).

Key insight: Full SFT is only worth it if you have a large, high-quality dataset (>10K examples) AND sufficient compute. For most production use cases, LoRA achieves 95-99% of full SFT quality at 10x lower cost.

# Core imports and environment verification
import os
import sys
import json
import math
import warnings
from pathlib import Path
from typing import Dict, List, Optional

import torch
import transformers
import trl
import datasets
import peft

warnings.filterwarnings("ignore")

print("=" * 55)
print("Environment Check")
print("=" * 55)
print(f"Python          : {sys.version.split()[0]}")
print(f"PyTorch         : {torch.__version__}")
print(f"Transformers    : {transformers.__version__}")
print(f"TRL             : {trl.__version__}")
print(f"PEFT            : {peft.__version__}")
print(f"Datasets        : {datasets.__version__}")
print()

# GPU check
if torch.cuda.is_available():
    device = "cuda"
    gpu_name  = torch.cuda.get_device_name(0)
    vram_gb   = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU             : {gpu_name}")
    print(f"VRAM            : {vram_gb:.1f} GB")
    print(f"BF16 supported  : {torch.cuda.is_bf16_supported()}")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    device = "mps"
    print("GPU             : Apple Silicon (MPS)")
    print("Note: BF16 not fully supported on MPS; will use FP32")
else:
    device = "cpu"
    print("GPU             : Not available — running on CPU (very slow)")

print(f"\nSelected device : {device}")

# Determine optimal dtype
if device == "cuda" and torch.cuda.is_bf16_supported():
    TORCH_DTYPE = torch.bfloat16
    DTYPE_STR = "bfloat16"
else:
    TORCH_DTYPE = torch.float16
    DTYPE_STR = "float16"

print(f"Training dtype  : {DTYPE_STR}")
print("=" * 55)

3. Dataset Preparation — Alpaca Format¶

The most widely used instruction-tuning format is Alpaca, originally from Stanford. Each example has three fields:

{
  "instruction": "Describe the structure of an atom.",
  "input": "",
  "output": "An atom consists of a nucleus containing protons and neutrons..."
}

When input is non-empty, it provides additional context:

{
  "instruction": "Translate the following sentence to French.",
  "input": "The weather is beautiful today.",
  "output": "Le temps est magnifique aujourd'hui."
}

Formatting Strategies¶

You must convert raw Alpaca fields into a single text string that the model sees. Two common templates:

Alpaca Prompt Template (classic):

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}

ChatML Template (modern, used by Qwen, Mistral-instruct):

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{instruction}\n{input}<|im_end|>
<|im_start|>assistant
{output}<|im_end|>

Critical: Always use the chat template that matches your base model. Mismatched templates reduce quality significantly.

Loss Masking¶

SFTTrainer automatically masks the prompt tokens so the model only learns to predict the response tokens. This prevents the model from “wasting” capacity learning to predict the instruction text it already knows.

from datasets import load_dataset, DatasetDict

# Load the Stanford Alpaca dataset (52K instruction examples)
print("Loading tatsu-lab/alpaca dataset...")
raw_dataset = load_dataset("tatsu-lab/alpaca", split="train")
print(f"Total examples: {len(raw_dataset):,}")
print(f"Columns: {raw_dataset.column_names}")
print()

# Preview a few examples
print("Sample examples:")
print("-" * 60)
for i, example in enumerate(raw_dataset.select(range(3))):
    print(f"\n[Example {i+1}]")
    print(f"  instruction : {example['instruction'][:80]}..." if len(example['instruction']) > 80 else f"  instruction : {example['instruction']}")
    print(f"  input       : {example['input'][:60]}" if example['input'] else "  input       : (empty)")
    print(f"  output      : {example['output'][:80]}..." if len(example['output']) > 80 else f"  output      : {example['output']}")

# Formatting function: converts Alpaca fields → single training string
# We use the Alpaca prompt template (compatible with most models)

ALPACA_PROMPT_WITH_INPUT = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

ALPACA_PROMPT_WITHOUT_INPUT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{output}"""


def format_alpaca(example: Dict) -> Dict:
    """Format a single Alpaca example into a training text string."""
    if example.get("input", "").strip():
        text = ALPACA_PROMPT_WITH_INPUT.format(
            instruction=example["instruction"],
            input=example["input"],
            output=example["output"],
        )
    else:
        text = ALPACA_PROMPT_WITHOUT_INPUT.format(
            instruction=example["instruction"],
            output=example["output"],
        )
    return {"text": text}


# Apply formatting to the full dataset
formatted_dataset = raw_dataset.map(
    format_alpaca,
    remove_columns=raw_dataset.column_names,  # drop original columns
    desc="Formatting Alpaca examples",
)

# Split into train / validation
split = formatted_dataset.train_test_split(test_size=0.02, seed=42)
train_dataset = split["train"]
eval_dataset  = split["test"]

print(f"Train examples : {len(train_dataset):,}")
print(f"Eval examples  : {len(eval_dataset):,}")
print()
print("Sample formatted text:")
print("-" * 60)
print(train_dataset[0]["text"][:500])
print("...")

# Analyze token length distribution to choose max_seq_length
# This helps avoid truncating too many examples or padding wastefully

from transformers import AutoTokenizer
import matplotlib
matplotlib.use("Agg")  # non-interactive backend for notebook
import matplotlib.pyplot as plt
import numpy as np

MODEL_ID = "Qwen/Qwen2.5-1.5B-Instruct"

print(f"Loading tokenizer: {MODEL_ID}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# Set pad token (required for batch training)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id
    print(f"Set pad_token = eos_token ({tokenizer.eos_token!r})")

print(f"Vocabulary size : {tokenizer.vocab_size:,}")
print(f"Max model length: {tokenizer.model_max_length}")
print()

# Sample 2000 examples to estimate lengths (faster than full dataset)
sample_size = min(2000, len(train_dataset))
sample_texts = train_dataset.select(range(sample_size))["text"]

lengths = [
    len(tokenizer.encode(t, add_special_tokens=True))
    for t in sample_texts
]

lengths_arr = np.array(lengths)
print(f"Token length statistics (n={sample_size}):")
print(f"  min    : {lengths_arr.min()}")
print(f"  mean   : {lengths_arr.mean():.1f}")
print(f"  median : {np.median(lengths_arr):.1f}")
print(f"  p95    : {np.percentile(lengths_arr, 95):.1f}")
print(f"  p99    : {np.percentile(lengths_arr, 99):.1f}")
print(f"  max    : {lengths_arr.max()}")
print()

p95 = int(np.percentile(lengths_arr, 95))
MAX_SEQ_LENGTH = min(2048, max(512, p95 + 64))  # round up, cap at 2048
print(f"Recommended max_seq_length: {MAX_SEQ_LENGTH}")
print(f"(covers {(lengths_arr <= MAX_SEQ_LENGTH).mean() * 100:.1f}% of examples without truncation)")

# Plot distribution
fig, ax = plt.subplots(figsize=(9, 4))
ax.hist(lengths_arr, bins=60, color="steelblue", edgecolor="white", alpha=0.85)
ax.axvline(MAX_SEQ_LENGTH, color="firebrick", linestyle="--", linewidth=2,
           label=f"max_seq_length = {MAX_SEQ_LENGTH}")
ax.axvline(np.median(lengths_arr), color="darkorange", linestyle="-", linewidth=2,
           label=f"median = {np.median(lengths_arr):.0f}")
ax.set_xlabel("Token count per example", fontsize=12)
ax.set_ylabel("Number of examples", fontsize=12)
ax.set_title("Token Length Distribution — Alpaca Dataset", fontsize=14)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("token_length_distribution.png", dpi=120)
plt.show()
print("Saved: token_length_distribution.png")

4. Loading the Base Model¶

Memory Requirements at Different Precisions¶

A rough rule of thumb: N billion parameters ≈ N × multiplier GB VRAM

Precision	Bytes/param	1.5B model	7B model	13B model	70B model
FP32	4 bytes	6 GB	28 GB	52 GB	280 GB
BF16/FP16	2 bytes	3 GB	14 GB	26 GB	140 GB
INT8	1 byte	1.5 GB	7 GB	13 GB	70 GB
NF4 (QLoRA)	0.5 bytes	0.75 GB	3.5 GB	6.5 GB	35 GB

For training, you also need memory for:

Gradients: same size as the model
Optimizer states (AdamW): 2× the model size (momentum + variance)
Activations: depends on batch size and sequence length

Total VRAM for full SFT in BF16 (7B model): ~14 (model) + 14 (gradients) + 28 (optimizer) = ~56 GB

This is why we use a 1.5B model for full SFT demonstrations and QLoRA for larger models.

Gradient Checkpointing¶

Gradient checkpointing trades compute for memory. Instead of storing all intermediate activations, it recomputes them during the backward pass. This reduces activation memory by ~5-10x at the cost of ~20% slower training. Always enable it for full SFT.

from transformers import AutoModelForCausalLM, AutoTokenizer

def get_gpu_memory_gb() -> float:
    """Return currently allocated GPU memory in GB."""
    if torch.cuda.is_available():
        return torch.cuda.memory_allocated() / 1e9
    return 0.0

def count_parameters(model) -> Dict:
    """Count total and trainable parameters."""
    total      = sum(p.numel() for p in model.parameters())
    trainable  = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return {
        "total":     total,
        "trainable": trainable,
        "pct":       trainable / total * 100,
    }


print(f"Loading model: {MODEL_ID}")
print(f"Dtype: {DTYPE_STR}")
mem_before = get_gpu_memory_gb()

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=TORCH_DTYPE,          # BF16 or FP16 depending on GPU
    device_map="auto",                 # automatically place layers across available GPUs
    trust_remote_code=True,            # required for Qwen models
    attn_implementation="eager",       # use "flash_attention_2" if flash-attn installed
)

mem_after = get_gpu_memory_gb()

# Enable gradient checkpointing to save activation memory
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})

# Enable input requires_grad for gradient checkpointing compatibility
model.enable_input_require_grads()

params = count_parameters(model)

print()
print(f"Model loaded successfully")
print(f"  Total parameters   : {params['total']:,} ({params['total']/1e9:.2f}B)")
print(f"  Trainable params   : {params['trainable']:,} ({params['pct']:.1f}%)")
print(f"  GPU memory (model) : {mem_after - mem_before:.2f} GB")
print(f"  Model dtype        : {next(model.parameters()).dtype}")
print(f"  Device map         : {model.hf_device_map if hasattr(model, 'hf_device_map') else device}")

5. Training Hyperparameter Guide¶

Choosing the right hyperparameters is critical. Here is a practical guide:

Learning Rate¶

Scenario	Suggested LR
Full SFT (all weights)	`1e-5` to `3e-5`
LoRA adapters only	`1e-4` to `3e-4`
Very large dataset (>100K)	`5e-6` to `1e-5`
Small dataset (<1K)	`5e-6` (lower to avoid overfitting)

Too high: loss explodes, NaN gradients
Too low: very slow convergence, may never reach good performance
Rule: start with 2e-5 for full SFT; tune if needed

Batch Size and Gradient Accumulation¶

The effective batch size = per_device_train_batch_size × gradient_accumulation_steps × num_gpus

Target effective batch size: 32–128 for instruction tuning.

GB GPU  → per_device=1, grad_accum=32  → effective=32
GB GPU → per_device=4, grad_accum=8   → effective=32
GB GPU → per_device=16, grad_accum=2  → effective=32

Warmup¶

Always use a warmup period (5-10% of total steps). This prevents early training instability when gradients are large and noisy.

LR Scheduler¶

cosine: best general choice — smoothly decays LR to near zero
linear: simple alternative, slightly worse
constant_with_warmup: good for short runs

Epochs vs Steps¶

For large datasets (>10K examples), 1-3 epochs is usually enough. Over-training causes catastrophic forgetting (the model loses general knowledge). Monitor validation loss and stop when it starts increasing.

Weight Decay¶

Use weight_decay=0.01 or 0.1 to regularize. Do not apply to bias/norm parameters (Hugging Face handles this automatically).

Max Gradient Norm¶

max_grad_norm=1.0 — clip gradients to prevent spikes. If you see NaN losses, try 0.3.

# Demonstrate effective batch size calculation
print("Effective Batch Size Calculator")
print("=" * 50)

configs = [
    {"vram": "4 GB",  "per_device": 1, "grad_accum": 32, "n_gpus": 1},
    {"vram": "8 GB",  "per_device": 2, "grad_accum": 16, "n_gpus": 1},
    {"vram": "16 GB", "per_device": 4, "grad_accum": 8,  "n_gpus": 1},
    {"vram": "24 GB", "per_device": 8, "grad_accum": 4,  "n_gpus": 1},
    {"vram": "2×24 GB","per_device": 8,"grad_accum": 4,  "n_gpus": 2},
    {"vram": "8×80 GB","per_device": 32,"grad_accum": 1, "n_gpus": 8},
]

print(f"{'Hardware':<12} {'per_device':>10} {'grad_accum':>10} {'n_gpus':>7} {'effective_bs':>13}")
print("-" * 60)
for c in configs:
    effective = c["per_device"] * c["grad_accum"] * c["n_gpus"]
    print(f"{c['vram']:<12} {c['per_device']:>10} {c['grad_accum']:>10} {c['n_gpus']:>7} {effective:>13}")

print()
print("Mixed precision comparison:")
print("-" * 40)

model_params = 1.5e9  # 1.5B parameters
for dtype, bytes_per_param in [("FP32", 4), ("BF16/FP16", 2), ("INT8", 1), ("NF4", 0.5)]:
    model_gb   = model_params * bytes_per_param / 1e9
    grad_gb    = model_params * bytes_per_param / 1e9  # same as model (full SFT)
    optim_gb   = model_params * 8 / 1e9               # AdamW: 2 FP32 states
    total_gb   = model_gb + grad_gb + optim_gb
    print(f"{dtype:<10}: model={model_gb:.1f}GB  grad={grad_gb:.1f}GB  optim={optim_gb:.1f}GB  total≈{total_gb:.1f}GB")

6. Flash Attention 2¶

Flash Attention 2 is a memory-efficient, IO-aware exact attention algorithm by Tri Dao (2023). It:

Reduces attention memory from O(N²) to O(N) in VRAM
Achieves 2-4x speedup over standard attention for long sequences
Is mathematically equivalent (exact, not approximate)
Requires: Ampere+ GPU (RTX 3000s, A100, H100), CUDA 11.6+

When to Use It¶

Sequence Length	Speedup	Memory Savings
512 tokens	~1.2x	~20%
1024 tokens	~1.8x	~40%
2048 tokens	~2.5x	~60%
4096+ tokens	~4x+	~75%+

Installation¶

pip install flash-attn --no-build-isolation
# Or via conda:
conda install -c conda-forge flash-attn

Compilation takes 10-15 minutes on first install.

# Check if Flash Attention 2 is available and demonstrate how to enable it

def check_flash_attention() -> bool:
    """Check if flash_attn is installed and the GPU supports it."""
    try:
        import flash_attn
        fa_version = flash_attn.__version__
        if not torch.cuda.is_available():
            print("Flash Attention requires a CUDA GPU — not available.")
            return False
        major, _ = torch.cuda.get_device_capability()
        if major < 8:  # Ampere is SM80
            print(f"Flash Attention requires Ampere GPU (SM80+). Current: SM{major}0")
            return False
        print(f"Flash Attention {fa_version} available — GPU is compatible.")
        return True
    except ImportError:
        print("flash_attn not installed. Install with: pip install flash-attn --no-build-isolation")
        return False


USE_FLASH_ATTN = check_flash_attention()
ATTN_IMPL = "flash_attention_2" if USE_FLASH_ATTN else "eager"
print(f"\nUsing attention implementation: {ATTN_IMPL}")

print()
print("To load a model with Flash Attention 2:")
print("-" * 50)
print("""
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",  # <-- enable FA2
)

# Note: Flash Attention 2 requires bfloat16 or float16 (not float32)
# Note: Not all model architectures support FA2 (check HF docs)
""")

7. Setting Up SFTTrainer¶

SFTTrainer is the Hugging Face TRL library’s specialized trainer for supervised fine-tuning. It extends the base Trainer with:

Automatic response masking (instruction tokens do not contribute to loss)
Sequence packing (combine short examples to fill context window — 2-3x throughput)
Dataset formatting utilities
LoRA/QLoRA integration
Built-in logging of training/eval loss

Key Parameters¶

Parameter	Description	Typical Value
`dataset_text_field`	Column name containing the formatted text	`"text"`
`max_seq_length`	Truncate sequences longer than this	512–2048
`packing`	Pack multiple short examples into one	`True` for short examples
`dataset_num_proc`	CPU workers for tokenization	4
`num_train_epochs`	Number of passes over the data	1–3
`per_device_train_batch_size`	Batch size per GPU	1–8
`gradient_accumulation_steps`	Steps before optimizer update	8–32
`learning_rate`	Peak LR (cosine schedule)	`2e-5`
`warmup_ratio`	Fraction of steps for warmup	`0.05`
`bf16`	Use BF16 mixed precision	`True` (Ampere+)
`fp16`	Use FP16 mixed precision	`True` (older GPUs)
`save_steps`	Save checkpoint every N steps	100–500
`logging_steps`	Log metrics every N steps	10–50

from transformers import TrainingArguments
from trl import SFTTrainer, SFTConfig

OUTPUT_DIR = "./sft-qwen2.5-1.5b-alpaca"

# -----------------------------------------------------------------------
# Training arguments — each parameter is documented
# -----------------------------------------------------------------------
training_args = SFTConfig(
    # --- Output ---
    output_dir=OUTPUT_DIR,                 # where to save checkpoints + logs
    run_name="qwen2.5-sft-alpaca",         # experiment name for logging

    # --- Dataset ---
    max_seq_length=MAX_SEQ_LENGTH,         # truncate sequences beyond this
    dataset_text_field="text",             # column in dataset containing text
    packing=True,                          # pack short examples → higher throughput
    dataset_num_proc=4,                    # CPU workers for tokenization

    # --- Batch size / accumulation ---
    per_device_train_batch_size=2,         # per-GPU batch size (lower if OOM)
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,         # effective batch = 2 × 8 = 16

    # --- Optimizer ---
    optim="adamw_torch_fused",             # fused AdamW (faster on CUDA)
    learning_rate=2e-5,                    # peak LR for full SFT
    weight_decay=0.01,                     # L2 regularization
    max_grad_norm=1.0,                     # gradient clipping threshold
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e-8,

    # --- LR schedule ---
    num_train_epochs=1,                    # 1 epoch for demo; use 2-3 for production
    lr_scheduler_type="cosine",            # cosine decay (recommended)
    warmup_ratio=0.05,                     # 5% of steps for warmup

    # --- Mixed precision ---
    bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(),
    fp16=torch.cuda.is_available() and not torch.cuda.is_bf16_supported(),

    # --- Memory optimizations ---
    gradient_checkpointing=True,           # trade compute for memory
    gradient_checkpointing_kwargs={"use_reentrant": False},
    dataloader_pin_memory=True,            # faster data transfer to GPU
    dataloader_num_workers=2,

    # --- Checkpointing ---
    save_strategy="steps",
    save_steps=200,
    save_total_limit=3,                    # keep only last 3 checkpoints
    load_best_model_at_end=True,           # load checkpoint with best eval loss

    # --- Evaluation ---
    eval_strategy="steps",
    eval_steps=200,

    # --- Logging ---
    logging_steps=10,
    logging_first_step=True,
    report_to="none",                      # change to "wandb" or "tensorboard" for tracking

    # --- Reproducibility ---
    seed=42,
    data_seed=42,
)

print("SFTConfig created successfully.")
print(f"  Output dir          : {training_args.output_dir}")
print(f"  Epochs              : {training_args.num_train_epochs}")
print(f"  Batch (per device)  : {training_args.per_device_train_batch_size}")
print(f"  Grad accumulation   : {training_args.gradient_accumulation_steps}")
print(f"  Effective batch     : {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  Learning rate       : {training_args.learning_rate}")
print(f"  LR scheduler        : {training_args.lr_scheduler_type}")
print(f"  BF16                : {training_args.bf16}")
print(f"  FP16                : {training_args.fp16}")
print(f"  Grad checkpointing  : {training_args.gradient_checkpointing}")
print(f"  Max seq length      : {training_args.max_seq_length}")

# Initialize SFTTrainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# Check how many training steps will run
if hasattr(trainer, 'args') and hasattr(trainer.args, 'max_steps'):
    print(f"Max steps           : {trainer.args.max_steps}")

# Show model summary
print()
print("SFTTrainer initialized.")
print(f"  Train dataset size  : {len(train_dataset):,}")
print(f"  Eval dataset size   : {len(eval_dataset):,}")
print(f"  Trainable params    : {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
print()
print("Ready to train. Run trainer.train() to start.")

8. Training and Monitoring Loss¶

What to Watch During Training¶

Good training looks like:

Train loss decreases smoothly from ~2.5 to ~1.0 (for instruction tuning)
Eval loss closely tracks train loss (small gap = good generalization)
No NaN losses or sudden spikes

Warning signs:

Symptom	Likely Cause	Fix
Loss goes NaN	LR too high, mixed precision issue	Lower LR, try FP16→BF16
Loss stuck at ~2.5	LR too low, bad data format	Increase LR, check formatting
Eval loss rises while train falls	Overfitting	Reduce epochs, add dropout, smaller dataset
OOM error	Batch size too large	Reduce batch size, enable gradient checkpointing
Very slow	No mixed precision	Enable bf16=True or fp16=True

Interpreting Loss Values¶

For causal LM on instruction data:

Initial loss: ~2.0–3.0 (depends on model and data)
After 1 epoch: ~1.0–1.5 (converging)
After 3 epochs: ~0.8–1.2 (well-trained)
Below 0.5: Likely memorizing training data (overfit)

import time

print("Starting training...")
print("Note: On CPU this will be very slow. Use a GPU for meaningful training.")
print("=" * 55)

mem_before_train = get_gpu_memory_gb()
start_time = time.time()

# Run training
train_result = trainer.train()

elapsed = time.time() - start_time
mem_after_train = get_gpu_memory_gb()

print()
print("Training complete!")
print(f"  Total time          : {elapsed/60:.1f} min")
print(f"  Train loss (final)  : {train_result.training_loss:.4f}")
print(f"  Samples/sec         : {train_result.metrics.get('train_samples_per_second', 'N/A')}")
print(f"  GPU memory used     : {mem_after_train:.2f} GB")

# Plot training and evaluation loss curves

def extract_loss_history(trainer_state):
    """Extract train and eval loss from trainer log history."""
    train_steps, train_losses = [], []
    eval_steps,  eval_losses  = [], []

    for entry in trainer_state.log_history:
        if "loss" in entry and "eval_loss" not in entry:
            train_steps.append(entry["step"])
            train_losses.append(entry["loss"])
        if "eval_loss" in entry:
            eval_steps.append(entry["step"])
            eval_losses.append(entry["eval_loss"])

    return train_steps, train_losses, eval_steps, eval_losses


train_steps, train_losses, eval_steps, eval_losses = extract_loss_history(trainer.state)

if train_losses:
    fig, ax = plt.subplots(figsize=(10, 5))

    ax.plot(train_steps, train_losses, color="steelblue", linewidth=1.5,
            alpha=0.8, label="Train loss")

    if eval_losses:
        ax.plot(eval_steps, eval_losses, color="firebrick", linewidth=2.0,
                marker="o", markersize=5, label="Eval loss")

    # Smoothed train loss (moving average)
    if len(train_losses) > 10:
        window = max(1, len(train_losses) // 20)
        smoothed = np.convolve(train_losses, np.ones(window)/window, mode="valid")
        smooth_steps = train_steps[window-1:]
        ax.plot(smooth_steps, smoothed, color="navy", linewidth=2.5,
                label=f"Train loss (smoothed, w={window})")

    ax.set_xlabel("Training Steps", fontsize=12)
    ax.set_ylabel("Cross-Entropy Loss", fontsize=12)
    ax.set_title("SFT Training — Loss Curves", fontsize=14)
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig("sft_loss_curves.png", dpi=120)
    plt.show()
    print("Saved: sft_loss_curves.png")
else:
    print("No loss history available (training may not have completed).")
    print("Generating synthetic loss curve for illustration...")

    # Illustrative synthetic loss curve
    steps = np.arange(1, 501)
    train_loss_synth = 2.5 * np.exp(-steps / 150) + 0.9 + 0.1 * np.random.randn(500) * np.exp(-steps / 300)
    eval_loss_synth  = 2.4 * np.exp(-steps[::20] / 150) + 0.95 + 0.05 * np.random.randn(25)

    fig, ax = plt.subplots(figsize=(10, 5))
    ax.plot(steps, train_loss_synth, color="steelblue", linewidth=1.2, alpha=0.7, label="Train loss")
    ax.plot(steps[::20], eval_loss_synth, color="firebrick", linewidth=2.0,
            marker="o", markersize=5, label="Eval loss")
    ax.set_xlabel("Training Steps", fontsize=12)
    ax.set_ylabel("Cross-Entropy Loss", fontsize=12)
    ax.set_title("SFT Training — Loss Curves (Illustrative)", fontsize=14)
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig("sft_loss_curves.png", dpi=120)
    plt.show()

9. Saving Checkpoints¶

What Gets Saved¶

When you call trainer.save_model(), the following files are written:

sft-qwen2.5-1.5b-alpaca/
├── config.json                  # model architecture config
├── tokenizer.json               # tokenizer vocabulary
├── tokenizer_config.json
├── special_tokens_map.json
├── model.safetensors            # model weights (safetensors format)
│   OR
├── pytorch_model.bin            # model weights (older PyTorch format)
└── training_args.bin            # training configuration

Checkpoint vs Final Model¶

Checkpoint (checkpoint-200/, checkpoint-400/): Intermediate saves during training. Includes optimizer state for resuming.
Final model (root output dir): The final weights after all training. Smaller because optimizer state is not included.

Resuming from Checkpoint¶

trainer.train(resume_from_checkpoint="./output_dir/checkpoint-200")
# or to resume from the latest checkpoint automatically:
trainer.train(resume_from_checkpoint=True)

import os

# Save the final model and tokenizer
print("Saving fine-tuned model...")
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

# Save training metrics
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()

# List saved files
print()
print(f"Files saved to: {OUTPUT_DIR}/")
saved_files = sorted(os.listdir(OUTPUT_DIR))
for f in saved_files:
    full_path = os.path.join(OUTPUT_DIR, f)
    if os.path.isfile(full_path):
        size_mb = os.path.getsize(full_path) / 1e6
        print(f"  {f:<45} {size_mb:>8.1f} MB")
    else:
        print(f"  {f}/ (directory)")

print()
total_mb = sum(
    os.path.getsize(os.path.join(OUTPUT_DIR, f)) / 1e6
    for f in os.listdir(OUTPUT_DIR)
    if os.path.isfile(os.path.join(OUTPUT_DIR, f))
)
print(f"Total model size: {total_mb:.0f} MB ({total_mb/1024:.2f} GB)")

# Load the fine-tuned model and run inference

from transformers import AutoModelForCausalLM, AutoTokenizer

print("Loading fine-tuned model for inference...")
ft_tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR, trust_remote_code=True)
ft_model = AutoModelForCausalLM.from_pretrained(
    OUTPUT_DIR,
    torch_dtype=TORCH_DTYPE,
    device_map="auto",
    trust_remote_code=True,
)
ft_model.eval()


def generate_response(model, tokenizer, instruction: str, input_text: str = "",
                       max_new_tokens: int = 256, temperature: float = 0.7) -> str:
    """Generate a response using the fine-tuned model."""
    if input_text.strip():
        prompt = ALPACA_PROMPT_WITH_INPUT.format(
            instruction=instruction,
            input=input_text,
            output="",  # empty — model will complete this
        )
    else:
        prompt = ALPACA_PROMPT_WITHOUT_INPUT.format(
            instruction=instruction,
            output="",
        )
    # Remove the trailing empty "### Response:" text to get clean prompt
    prompt = prompt.rstrip()

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    input_len = inputs["input_ids"].shape[1]

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=temperature > 0,
            top_p=0.9,
            repetition_penalty=1.1,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
        )

    # Decode only the newly generated tokens
    new_tokens = outputs[0][input_len:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()


# Test with a few examples
test_cases = [
    {"instruction": "Explain what photosynthesis is in simple terms.", "input": ""},
    {"instruction": "Write a haiku about programming.", "input": ""},
    {"instruction": "Convert this temperature from Celsius to Fahrenheit.", "input": "100°C"},
]

print("=" * 60)
print("Fine-tuned Model Inference")
print("=" * 60)

for i, tc in enumerate(test_cases):
    print(f"\n[Test {i+1}]")
    print(f"Instruction: {tc['instruction']}")
    if tc['input']:
        print(f"Input: {tc['input']}")
    response = generate_response(ft_model, ft_tokenizer, tc["instruction"], tc["input"])
    print(f"Response: {response}")
    print("-" * 60)

10. Troubleshooting Common Issues¶

CUDA Out of Memory (OOM)¶

RuntimeError: CUDA out of memory. Tried to allocate X GB.

Solutions (try in order):

Reduce per_device_train_batch_size to 1
Increase gradient_accumulation_steps to compensate
Enable gradient_checkpointing=True
Reduce max_seq_length (e.g., 512 instead of 2048)
Use a smaller model or switch to QLoRA (see Notebook 04)

# Quick OOM debug: clear cache and check memory
torch.cuda.empty_cache()
print(torch.cuda.memory_summary())

Loss is NaN¶

{'loss': nan, 'learning_rate': ...}

Solutions:

Lower learning rate by 10x
Add gradient clipping: max_grad_norm=0.3
Switch from FP16 to BF16 (FP16 has smaller dynamic range)
Check for bad data (empty outputs, very long sequences)

Slow Training¶

Solutions:

Enable BF16/FP16 mixed precision
Set dataloader_num_workers=4
Enable dataloader_pin_memory=True
Use optim="adamw_torch_fused"
Install Flash Attention 2
Enable sequence packing (packing=True)

Model Generates Garbage After Training¶

Solutions:

Check prompt template matches training format exactly
Verify pad_token is set correctly
Ensure EOS token is included at end of outputs in training data
Check if model was over-trained (too many epochs)
Verify dataset quality — bad data = bad output

Training Loss Not Decreasing¶

Solutions:

Check learning rate is not too low
Verify data formatting is correct (outputs not empty)
Check that tokenizer chat template matches model
Ensure gradient checkpointing is compatible with your model version

ImportError / Version Conflicts¶

# Known compatible versions (as of early 2025)
pip install transformers==4.40.0 trl==0.8.6 peft==0.10.0 \
            accelerate==0.29.3 bitsandbytes==0.43.1 datasets==2.18.0

# Free GPU memory after training (useful in notebooks)
import gc

def free_memory(*objects):
    """Delete objects and clear GPU cache."""
    for obj in objects:
        del obj
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
    print("GPU memory freed.")
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        reserved  = torch.cuda.memory_reserved() / 1e9
        print(f"  Allocated: {allocated:.2f} GB")
        print(f"  Reserved : {reserved:.2f} GB")


# Uncomment to free memory:
# free_memory(model, trainer, ft_model, ft_tokenizer)

print("Memory management utilities ready.")
print("Call free_memory(model, trainer) when you are done to reclaim VRAM.")

Summary — Complete SFT Pipeline¶

Step 1: Load raw dataset
   load_dataset("tatsu-lab/alpaca")
   
Step 2: Format with prompt template
   format_alpaca(example) → {"text": "..."}
   
Step 3: Analyze token lengths
   Choose max_seq_length to cover 95th percentile
   
Step 4: Load model in BF16
   AutoModelForCausalLM.from_pretrained(..., torch_dtype=bfloat16)
   model.gradient_checkpointing_enable()
   
Step 5: Configure SFTConfig
   learning_rate=2e-5, gradient_accumulation_steps=8
   bf16=True, packing=True
   
Step 6: Create SFTTrainer and train
   trainer = SFTTrainer(model, tokenizer, args, train_dataset, eval_dataset)
   trainer.train()
   
Step 7: Monitor loss curves
   Watch for convergence, avoid overfitting
   
Step 8: Save and test
   trainer.save_model(output_dir)
   Generate sample responses

Key Takeaways¶

Full SFT updates all model weights — powerful but memory-intensive
Gradient checkpointing is essential for training on limited VRAM
Mixed precision (BF16) cuts memory nearly in half with no quality loss
Gradient accumulation lets you achieve large effective batch sizes on small GPUs
Sequence packing dramatically improves training throughput (2-3x)
Monitor eval loss — stop training when it starts to rise
Flash Attention 2 gives 2-4x speedup for long sequences on Ampere+ GPUs

Next Steps¶

Notebook 03 — LoRA Basics: reduce trainable parameters to ~0.5% for similar quality
Notebook 04 — QLoRA: fine-tune 7B+ models on a single consumer GPU
Notebook 05 — Instruction Tuning: curate better instruction datasets
Notebook 07 — RLHF/DPO: align model behavior with human preferences

References: