Supervised Fine-Tuning (SFT) β€” Complete WorkflowΒΆ

Phase 12 β€” LLM Fine-tuning | Notebook 02

What You Will LearnΒΆ

  • What supervised fine-tuning is and when to use it over prompting

  • Full fine-tuning vs parameter-efficient fine-tuning (PEFT) trade-offs

  • Setting up SFTTrainer from the TRL library

  • Training on alpaca-style instruction-following datasets

  • Monitoring training with loss curves

  • Gradient accumulation and mixed-precision training

  • Using Flash Attention 2 for faster training

  • Saving and loading checkpoints

  • Complete working pipeline: data β†’ trained model β†’ inference

  • Troubleshooting common training failures

Model used: Qwen/Qwen2.5-1.5B-Instruct
Dataset: tatsu-lab/alpaca (52K instruction-following examples)
Estimated GPU: 16GB+ for full SFT in BF16; 8GB+ with gradient checkpointing

SFT Pipeline OverviewΒΆ

Raw Data                Formatted Data           Fine-tuned Model
─────────               ──────────────           ────────────────
{instruction,    ──►   <|system|>          ──►   Adapted weights
 input,                Follow instructions        Lower loss on
 output}               <|user|>                   target tasks
                        {instruction}              Consistent
                        <|assistant|>              output format
                        {output}

SFT teaches a model to follow a specific style, format, and task β€” it does NOT add new knowledge. For new knowledge, use RAG.

# Install required libraries
# Run this cell once, then restart the kernel
!pip install -q transformers>=4.40.0 trl>=0.8.6 peft>=0.10.0 datasets>=2.18.0 accelerate>=0.29.0 bitsandbytes>=0.43.0 matplotlib torch

1. What Is SFT and When Should You Use It?ΒΆ

Supervised Fine-Tuning (SFT) is the process of continuing to train a pre-trained language model on a labeled dataset of (input, desired_output) pairs. The model has already learned general language understanding from pre-training; SFT steers it toward a specific behavior or task.

Decision FrameworkΒΆ

Situation

Recommended Approach

Task solvable with a good prompt

Prompt engineering (no training cost)

Need consistent output format

Few-shot prompting or SFT

Need specialized style (tone, domain jargon)

SFT

Have 100–100K labeled examples

SFT

Need to inject domain knowledge

RAG (retrieval-augmented generation)

Need aligned / safe behavior

RLHF or DPO (after SFT)

Limited GPU memory

QLoRA (see Notebook 04)

What SFT ChangesΒΆ

SFT optimizes the causal language modeling loss on your dataset:

\[\mathcal{L}_{SFT} = -\sum_{t} \log P_{\theta}(y_t \mid x, y_{<t})\]
  • \(x\) = the instruction/input tokens (context)

  • \(y_t\) = the expected output tokens

  • Only the output tokens contribute to the loss (the prompt tokens are masked)

Common SFT Use CasesΒΆ

  1. Instruction following β€” teach a base model to follow commands politely

  2. Domain adaptation β€” medical notes, legal documents, code in a specific style

  3. Output format β€” always respond in JSON, always cite sources

  4. Language/dialect β€” fine-tune for a specific language not well-covered in pre-training

  5. Persona β€” customer support bot with specific brand voice

2. Full Fine-Tuning vs Parameter-Efficient Fine-Tuning (PEFT)ΒΆ

Method

Trainable Params

VRAM (7B model)

Speed

Quality

Best For

Full SFT

100% (7B)

~56 GB

Slowest

Highest

Large compute budgets

LoRA (r=16)

~0.5% (35M)

~16 GB

3-5x faster

Very close to full

Most practical cases

QLoRA (r=16)

~0.5% (35M)

~6 GB

2-4x faster

Slight quality gap

Consumer GPUs

Prefix Tuning

<0.1%

~14 GB

Fast

Lower

Simple tasks

Adapters

~1–3%

~16 GB

Fast

Good

Multi-task learning

IA3

<0.1%

~14 GB

Fastest

Moderate

Low-resource tasks

This notebook demonstrates full SFT (all parameters updated) on a 1.5B model, which fits in 8–12 GB VRAM in BF16. For larger models, use LoRA (Notebook 03) or QLoRA (Notebook 04).

Key insight: Full SFT is only worth it if you have a large, high-quality dataset (>10K examples) AND sufficient compute. For most production use cases, LoRA achieves 95-99% of full SFT quality at 10x lower cost.

# Core imports and environment verification
import os
import sys
import json
import math
import warnings
from pathlib import Path
from typing import Dict, List, Optional

import torch
import transformers
import trl
import datasets
import peft

warnings.filterwarnings("ignore")

print("=" * 55)
print("Environment Check")
print("=" * 55)
print(f"Python          : {sys.version.split()[0]}")
print(f"PyTorch         : {torch.__version__}")
print(f"Transformers    : {transformers.__version__}")
print(f"TRL             : {trl.__version__}")
print(f"PEFT            : {peft.__version__}")
print(f"Datasets        : {datasets.__version__}")
print()

# GPU check
if torch.cuda.is_available():
    device = "cuda"
    gpu_name  = torch.cuda.get_device_name(0)
    vram_gb   = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU             : {gpu_name}")
    print(f"VRAM            : {vram_gb:.1f} GB")
    print(f"BF16 supported  : {torch.cuda.is_bf16_supported()}")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    device = "mps"
    print("GPU             : Apple Silicon (MPS)")
    print("Note: BF16 not fully supported on MPS; will use FP32")
else:
    device = "cpu"
    print("GPU             : Not available β€” running on CPU (very slow)")

print(f"\nSelected device : {device}")

# Determine optimal dtype
if device == "cuda" and torch.cuda.is_bf16_supported():
    TORCH_DTYPE = torch.bfloat16
    DTYPE_STR = "bfloat16"
else:
    TORCH_DTYPE = torch.float16
    DTYPE_STR = "float16"

print(f"Training dtype  : {DTYPE_STR}")
print("=" * 55)

3. Dataset Preparation β€” Alpaca FormatΒΆ

The most widely used instruction-tuning format is Alpaca, originally from Stanford. Each example has three fields:

{
  "instruction": "Describe the structure of an atom.",
  "input": "",
  "output": "An atom consists of a nucleus containing protons and neutrons..."
}

When input is non-empty, it provides additional context:

{
  "instruction": "Translate the following sentence to French.",
  "input": "The weather is beautiful today.",
  "output": "Le temps est magnifique aujourd'hui."
}

Formatting StrategiesΒΆ

You must convert raw Alpaca fields into a single text string that the model sees. Two common templates:

Alpaca Prompt Template (classic):

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}

ChatML Template (modern, used by Qwen, Mistral-instruct):

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{instruction}\n{input}<|im_end|>
<|im_start|>assistant
{output}<|im_end|>

Critical: Always use the chat template that matches your base model. Mismatched templates reduce quality significantly.

Loss MaskingΒΆ

SFTTrainer automatically masks the prompt tokens so the model only learns to predict the response tokens. This prevents the model from β€œwasting” capacity learning to predict the instruction text it already knows.

from datasets import load_dataset, DatasetDict

# Load the Stanford Alpaca dataset (52K instruction examples)
print("Loading tatsu-lab/alpaca dataset...")
raw_dataset = load_dataset("tatsu-lab/alpaca", split="train")
print(f"Total examples: {len(raw_dataset):,}")
print(f"Columns: {raw_dataset.column_names}")
print()

# Preview a few examples
print("Sample examples:")
print("-" * 60)
for i, example in enumerate(raw_dataset.select(range(3))):
    print(f"\n[Example {i+1}]")
    print(f"  instruction : {example['instruction'][:80]}..." if len(example['instruction']) > 80 else f"  instruction : {example['instruction']}")
    print(f"  input       : {example['input'][:60]}" if example['input'] else "  input       : (empty)")
    print(f"  output      : {example['output'][:80]}..." if len(example['output']) > 80 else f"  output      : {example['output']}")
# Formatting function: converts Alpaca fields β†’ single training string
# We use the Alpaca prompt template (compatible with most models)

ALPACA_PROMPT_WITH_INPUT = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

ALPACA_PROMPT_WITHOUT_INPUT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{output}"""


def format_alpaca(example: Dict) -> Dict:
    """Format a single Alpaca example into a training text string."""
    if example.get("input", "").strip():
        text = ALPACA_PROMPT_WITH_INPUT.format(
            instruction=example["instruction"],
            input=example["input"],
            output=example["output"],
        )
    else:
        text = ALPACA_PROMPT_WITHOUT_INPUT.format(
            instruction=example["instruction"],
            output=example["output"],
        )
    return {"text": text}


# Apply formatting to the full dataset
formatted_dataset = raw_dataset.map(
    format_alpaca,
    remove_columns=raw_dataset.column_names,  # drop original columns
    desc="Formatting Alpaca examples",
)

# Split into train / validation
split = formatted_dataset.train_test_split(test_size=0.02, seed=42)
train_dataset = split["train"]
eval_dataset  = split["test"]

print(f"Train examples : {len(train_dataset):,}")
print(f"Eval examples  : {len(eval_dataset):,}")
print()
print("Sample formatted text:")
print("-" * 60)
print(train_dataset[0]["text"][:500])
print("...")
# Analyze token length distribution to choose max_seq_length
# This helps avoid truncating too many examples or padding wastefully

from transformers import AutoTokenizer
import matplotlib
matplotlib.use("Agg")  # non-interactive backend for notebook
import matplotlib.pyplot as plt
import numpy as np

MODEL_ID = "Qwen/Qwen2.5-1.5B-Instruct"

print(f"Loading tokenizer: {MODEL_ID}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# Set pad token (required for batch training)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id
    print(f"Set pad_token = eos_token ({tokenizer.eos_token!r})")

print(f"Vocabulary size : {tokenizer.vocab_size:,}")
print(f"Max model length: {tokenizer.model_max_length}")
print()

# Sample 2000 examples to estimate lengths (faster than full dataset)
sample_size = min(2000, len(train_dataset))
sample_texts = train_dataset.select(range(sample_size))["text"]

lengths = [
    len(tokenizer.encode(t, add_special_tokens=True))
    for t in sample_texts
]

lengths_arr = np.array(lengths)
print(f"Token length statistics (n={sample_size}):")
print(f"  min    : {lengths_arr.min()}")
print(f"  mean   : {lengths_arr.mean():.1f}")
print(f"  median : {np.median(lengths_arr):.1f}")
print(f"  p95    : {np.percentile(lengths_arr, 95):.1f}")
print(f"  p99    : {np.percentile(lengths_arr, 99):.1f}")
print(f"  max    : {lengths_arr.max()}")
print()

p95 = int(np.percentile(lengths_arr, 95))
MAX_SEQ_LENGTH = min(2048, max(512, p95 + 64))  # round up, cap at 2048
print(f"Recommended max_seq_length: {MAX_SEQ_LENGTH}")
print(f"(covers {(lengths_arr <= MAX_SEQ_LENGTH).mean() * 100:.1f}% of examples without truncation)")

# Plot distribution
fig, ax = plt.subplots(figsize=(9, 4))
ax.hist(lengths_arr, bins=60, color="steelblue", edgecolor="white", alpha=0.85)
ax.axvline(MAX_SEQ_LENGTH, color="firebrick", linestyle="--", linewidth=2,
           label=f"max_seq_length = {MAX_SEQ_LENGTH}")
ax.axvline(np.median(lengths_arr), color="darkorange", linestyle="-", linewidth=2,
           label=f"median = {np.median(lengths_arr):.0f}")
ax.set_xlabel("Token count per example", fontsize=12)
ax.set_ylabel("Number of examples", fontsize=12)
ax.set_title("Token Length Distribution β€” Alpaca Dataset", fontsize=14)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("token_length_distribution.png", dpi=120)
plt.show()
print("Saved: token_length_distribution.png")

4. Loading the Base ModelΒΆ

Memory Requirements at Different PrecisionsΒΆ

A rough rule of thumb: N billion parameters β‰ˆ N Γ— multiplier GB VRAM

Precision

Bytes/param

1.5B model

7B model

13B model

70B model

FP32

4 bytes

6 GB

28 GB

52 GB

280 GB

BF16/FP16

2 bytes

3 GB

14 GB

26 GB

140 GB

INT8

1 byte

1.5 GB

7 GB

13 GB

70 GB

NF4 (QLoRA)

0.5 bytes

0.75 GB

3.5 GB

6.5 GB

35 GB

For training, you also need memory for:

  • Gradients: same size as the model

  • Optimizer states (AdamW): 2Γ— the model size (momentum + variance)

  • Activations: depends on batch size and sequence length

Total VRAM for full SFT in BF16 (7B model): ~14 (model) + 14 (gradients) + 28 (optimizer) = ~56 GB

This is why we use a 1.5B model for full SFT demonstrations and QLoRA for larger models.

Gradient CheckpointingΒΆ

Gradient checkpointing trades compute for memory. Instead of storing all intermediate activations, it recomputes them during the backward pass. This reduces activation memory by ~5-10x at the cost of ~20% slower training. Always enable it for full SFT.

from transformers import AutoModelForCausalLM, AutoTokenizer

def get_gpu_memory_gb() -> float:
    """Return currently allocated GPU memory in GB."""
    if torch.cuda.is_available():
        return torch.cuda.memory_allocated() / 1e9
    return 0.0

def count_parameters(model) -> Dict:
    """Count total and trainable parameters."""
    total      = sum(p.numel() for p in model.parameters())
    trainable  = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return {
        "total":     total,
        "trainable": trainable,
        "pct":       trainable / total * 100,
    }


print(f"Loading model: {MODEL_ID}")
print(f"Dtype: {DTYPE_STR}")
mem_before = get_gpu_memory_gb()

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=TORCH_DTYPE,          # BF16 or FP16 depending on GPU
    device_map="auto",                 # automatically place layers across available GPUs
    trust_remote_code=True,            # required for Qwen models
    attn_implementation="eager",       # use "flash_attention_2" if flash-attn installed
)

mem_after = get_gpu_memory_gb()

# Enable gradient checkpointing to save activation memory
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})

# Enable input requires_grad for gradient checkpointing compatibility
model.enable_input_require_grads()

params = count_parameters(model)

print()
print(f"Model loaded successfully")
print(f"  Total parameters   : {params['total']:,} ({params['total']/1e9:.2f}B)")
print(f"  Trainable params   : {params['trainable']:,} ({params['pct']:.1f}%)")
print(f"  GPU memory (model) : {mem_after - mem_before:.2f} GB")
print(f"  Model dtype        : {next(model.parameters()).dtype}")
print(f"  Device map         : {model.hf_device_map if hasattr(model, 'hf_device_map') else device}")

5. Training Hyperparameter GuideΒΆ

Choosing the right hyperparameters is critical. Here is a practical guide:

Learning RateΒΆ

Scenario

Suggested LR

Full SFT (all weights)

1e-5 to 3e-5

LoRA adapters only

1e-4 to 3e-4

Very large dataset (>100K)

5e-6 to 1e-5

Small dataset (<1K)

5e-6 (lower to avoid overfitting)

  • Too high: loss explodes, NaN gradients

  • Too low: very slow convergence, may never reach good performance

  • Rule: start with 2e-5 for full SFT; tune if needed

Batch Size and Gradient AccumulationΒΆ

The effective batch size = per_device_train_batch_size Γ— gradient_accumulation_steps Γ— num_gpus

Target effective batch size: 32–128 for instruction tuning.

8 GB GPU  β†’ per_device=1, grad_accum=32  β†’ effective=32
16 GB GPU β†’ per_device=4, grad_accum=8   β†’ effective=32
80 GB GPU β†’ per_device=16, grad_accum=2  β†’ effective=32

WarmupΒΆ

Always use a warmup period (5-10% of total steps). This prevents early training instability when gradients are large and noisy.

LR SchedulerΒΆ

  • cosine: best general choice β€” smoothly decays LR to near zero

  • linear: simple alternative, slightly worse

  • constant_with_warmup: good for short runs

Epochs vs StepsΒΆ

For large datasets (>10K examples), 1-3 epochs is usually enough. Over-training causes catastrophic forgetting (the model loses general knowledge). Monitor validation loss and stop when it starts increasing.

Weight DecayΒΆ

Use weight_decay=0.01 or 0.1 to regularize. Do not apply to bias/norm parameters (Hugging Face handles this automatically).

Max Gradient NormΒΆ

max_grad_norm=1.0 β€” clip gradients to prevent spikes. If you see NaN losses, try 0.3.

# Demonstrate effective batch size calculation
print("Effective Batch Size Calculator")
print("=" * 50)

configs = [
    {"vram": "4 GB",  "per_device": 1, "grad_accum": 32, "n_gpus": 1},
    {"vram": "8 GB",  "per_device": 2, "grad_accum": 16, "n_gpus": 1},
    {"vram": "16 GB", "per_device": 4, "grad_accum": 8,  "n_gpus": 1},
    {"vram": "24 GB", "per_device": 8, "grad_accum": 4,  "n_gpus": 1},
    {"vram": "2Γ—24 GB","per_device": 8,"grad_accum": 4,  "n_gpus": 2},
    {"vram": "8Γ—80 GB","per_device": 32,"grad_accum": 1, "n_gpus": 8},
]

print(f"{'Hardware':<12} {'per_device':>10} {'grad_accum':>10} {'n_gpus':>7} {'effective_bs':>13}")
print("-" * 60)
for c in configs:
    effective = c["per_device"] * c["grad_accum"] * c["n_gpus"]
    print(f"{c['vram']:<12} {c['per_device']:>10} {c['grad_accum']:>10} {c['n_gpus']:>7} {effective:>13}")

print()
print("Mixed precision comparison:")
print("-" * 40)

model_params = 1.5e9  # 1.5B parameters
for dtype, bytes_per_param in [("FP32", 4), ("BF16/FP16", 2), ("INT8", 1), ("NF4", 0.5)]:
    model_gb   = model_params * bytes_per_param / 1e9
    grad_gb    = model_params * bytes_per_param / 1e9  # same as model (full SFT)
    optim_gb   = model_params * 8 / 1e9               # AdamW: 2 FP32 states
    total_gb   = model_gb + grad_gb + optim_gb
    print(f"{dtype:<10}: model={model_gb:.1f}GB  grad={grad_gb:.1f}GB  optim={optim_gb:.1f}GB  totalβ‰ˆ{total_gb:.1f}GB")

6. Flash Attention 2ΒΆ

Flash Attention 2 is a memory-efficient, IO-aware exact attention algorithm by Tri Dao (2023). It:

  • Reduces attention memory from O(NΒ²) to O(N) in VRAM

  • Achieves 2-4x speedup over standard attention for long sequences

  • Is mathematically equivalent (exact, not approximate)

  • Requires: Ampere+ GPU (RTX 3000s, A100, H100), CUDA 11.6+

When to Use ItΒΆ

Sequence Length

Speedup

Memory Savings

512 tokens

~1.2x

~20%

1024 tokens

~1.8x

~40%

2048 tokens

~2.5x

~60%

4096+ tokens

~4x+

~75%+

InstallationΒΆ

pip install flash-attn --no-build-isolation
# Or via conda:
conda install -c conda-forge flash-attn

Compilation takes 10-15 minutes on first install.

# Check if Flash Attention 2 is available and demonstrate how to enable it

def check_flash_attention() -> bool:
    """Check if flash_attn is installed and the GPU supports it."""
    try:
        import flash_attn
        fa_version = flash_attn.__version__
        if not torch.cuda.is_available():
            print("Flash Attention requires a CUDA GPU β€” not available.")
            return False
        major, _ = torch.cuda.get_device_capability()
        if major < 8:  # Ampere is SM80
            print(f"Flash Attention requires Ampere GPU (SM80+). Current: SM{major}0")
            return False
        print(f"Flash Attention {fa_version} available β€” GPU is compatible.")
        return True
    except ImportError:
        print("flash_attn not installed. Install with: pip install flash-attn --no-build-isolation")
        return False


USE_FLASH_ATTN = check_flash_attention()
ATTN_IMPL = "flash_attention_2" if USE_FLASH_ATTN else "eager"
print(f"\nUsing attention implementation: {ATTN_IMPL}")

print()
print("To load a model with Flash Attention 2:")
print("-" * 50)
print("""
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",  # <-- enable FA2
)

# Note: Flash Attention 2 requires bfloat16 or float16 (not float32)
# Note: Not all model architectures support FA2 (check HF docs)
""")

7. Setting Up SFTTrainerΒΆ

SFTTrainer is the Hugging Face TRL library’s specialized trainer for supervised fine-tuning. It extends the base Trainer with:

  • Automatic response masking (instruction tokens do not contribute to loss)

  • Sequence packing (combine short examples to fill context window β€” 2-3x throughput)

  • Dataset formatting utilities

  • LoRA/QLoRA integration

  • Built-in logging of training/eval loss

Key ParametersΒΆ

Parameter

Description

Typical Value

dataset_text_field

Column name containing the formatted text

"text"

max_seq_length

Truncate sequences longer than this

512–2048

packing

Pack multiple short examples into one

True for short examples

dataset_num_proc

CPU workers for tokenization

4

num_train_epochs

Number of passes over the data

1–3

per_device_train_batch_size

Batch size per GPU

1–8

gradient_accumulation_steps

Steps before optimizer update

8–32

learning_rate

Peak LR (cosine schedule)

2e-5

warmup_ratio

Fraction of steps for warmup

0.05

bf16

Use BF16 mixed precision

True (Ampere+)

fp16

Use FP16 mixed precision

True (older GPUs)

save_steps

Save checkpoint every N steps

100–500

logging_steps

Log metrics every N steps

10–50

from transformers import TrainingArguments
from trl import SFTTrainer, SFTConfig

OUTPUT_DIR = "./sft-qwen2.5-1.5b-alpaca"

# -----------------------------------------------------------------------
# Training arguments β€” each parameter is documented
# -----------------------------------------------------------------------
training_args = SFTConfig(
    # --- Output ---
    output_dir=OUTPUT_DIR,                 # where to save checkpoints + logs
    run_name="qwen2.5-sft-alpaca",         # experiment name for logging

    # --- Dataset ---
    max_seq_length=MAX_SEQ_LENGTH,         # truncate sequences beyond this
    dataset_text_field="text",             # column in dataset containing text
    packing=True,                          # pack short examples β†’ higher throughput
    dataset_num_proc=4,                    # CPU workers for tokenization

    # --- Batch size / accumulation ---
    per_device_train_batch_size=2,         # per-GPU batch size (lower if OOM)
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,         # effective batch = 2 Γ— 8 = 16

    # --- Optimizer ---
    optim="adamw_torch_fused",             # fused AdamW (faster on CUDA)
    learning_rate=2e-5,                    # peak LR for full SFT
    weight_decay=0.01,                     # L2 regularization
    max_grad_norm=1.0,                     # gradient clipping threshold
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e-8,

    # --- LR schedule ---
    num_train_epochs=1,                    # 1 epoch for demo; use 2-3 for production
    lr_scheduler_type="cosine",            # cosine decay (recommended)
    warmup_ratio=0.05,                     # 5% of steps for warmup

    # --- Mixed precision ---
    bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(),
    fp16=torch.cuda.is_available() and not torch.cuda.is_bf16_supported(),

    # --- Memory optimizations ---
    gradient_checkpointing=True,           # trade compute for memory
    gradient_checkpointing_kwargs={"use_reentrant": False},
    dataloader_pin_memory=True,            # faster data transfer to GPU
    dataloader_num_workers=2,

    # --- Checkpointing ---
    save_strategy="steps",
    save_steps=200,
    save_total_limit=3,                    # keep only last 3 checkpoints
    load_best_model_at_end=True,           # load checkpoint with best eval loss

    # --- Evaluation ---
    eval_strategy="steps",
    eval_steps=200,

    # --- Logging ---
    logging_steps=10,
    logging_first_step=True,
    report_to="none",                      # change to "wandb" or "tensorboard" for tracking

    # --- Reproducibility ---
    seed=42,
    data_seed=42,
)

print("SFTConfig created successfully.")
print(f"  Output dir          : {training_args.output_dir}")
print(f"  Epochs              : {training_args.num_train_epochs}")
print(f"  Batch (per device)  : {training_args.per_device_train_batch_size}")
print(f"  Grad accumulation   : {training_args.gradient_accumulation_steps}")
print(f"  Effective batch     : {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  Learning rate       : {training_args.learning_rate}")
print(f"  LR scheduler        : {training_args.lr_scheduler_type}")
print(f"  BF16                : {training_args.bf16}")
print(f"  FP16                : {training_args.fp16}")
print(f"  Grad checkpointing  : {training_args.gradient_checkpointing}")
print(f"  Max seq length      : {training_args.max_seq_length}")
# Initialize SFTTrainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# Check how many training steps will run
if hasattr(trainer, 'args') and hasattr(trainer.args, 'max_steps'):
    print(f"Max steps           : {trainer.args.max_steps}")

# Show model summary
print()
print("SFTTrainer initialized.")
print(f"  Train dataset size  : {len(train_dataset):,}")
print(f"  Eval dataset size   : {len(eval_dataset):,}")
print(f"  Trainable params    : {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
print()
print("Ready to train. Run trainer.train() to start.")

8. Training and Monitoring LossΒΆ

What to Watch During TrainingΒΆ

Good training looks like:

  • Train loss decreases smoothly from ~2.5 to ~1.0 (for instruction tuning)

  • Eval loss closely tracks train loss (small gap = good generalization)

  • No NaN losses or sudden spikes

Warning signs:

Symptom

Likely Cause

Fix

Loss goes NaN

LR too high, mixed precision issue

Lower LR, try FP16β†’BF16

Loss stuck at ~2.5

LR too low, bad data format

Increase LR, check formatting

Eval loss rises while train falls

Overfitting

Reduce epochs, add dropout, smaller dataset

OOM error

Batch size too large

Reduce batch size, enable gradient checkpointing

Very slow

No mixed precision

Enable bf16=True or fp16=True

Interpreting Loss ValuesΒΆ

For causal LM on instruction data:

  • Initial loss: ~2.0–3.0 (depends on model and data)

  • After 1 epoch: ~1.0–1.5 (converging)

  • After 3 epochs: ~0.8–1.2 (well-trained)

  • Below 0.5: Likely memorizing training data (overfit)

import time

print("Starting training...")
print("Note: On CPU this will be very slow. Use a GPU for meaningful training.")
print("=" * 55)

mem_before_train = get_gpu_memory_gb()
start_time = time.time()

# Run training
train_result = trainer.train()

elapsed = time.time() - start_time
mem_after_train = get_gpu_memory_gb()

print()
print("Training complete!")
print(f"  Total time          : {elapsed/60:.1f} min")
print(f"  Train loss (final)  : {train_result.training_loss:.4f}")
print(f"  Samples/sec         : {train_result.metrics.get('train_samples_per_second', 'N/A')}")
print(f"  GPU memory used     : {mem_after_train:.2f} GB")
# Plot training and evaluation loss curves

def extract_loss_history(trainer_state):
    """Extract train and eval loss from trainer log history."""
    train_steps, train_losses = [], []
    eval_steps,  eval_losses  = [], []

    for entry in trainer_state.log_history:
        if "loss" in entry and "eval_loss" not in entry:
            train_steps.append(entry["step"])
            train_losses.append(entry["loss"])
        if "eval_loss" in entry:
            eval_steps.append(entry["step"])
            eval_losses.append(entry["eval_loss"])

    return train_steps, train_losses, eval_steps, eval_losses


train_steps, train_losses, eval_steps, eval_losses = extract_loss_history(trainer.state)

if train_losses:
    fig, ax = plt.subplots(figsize=(10, 5))

    ax.plot(train_steps, train_losses, color="steelblue", linewidth=1.5,
            alpha=0.8, label="Train loss")

    if eval_losses:
        ax.plot(eval_steps, eval_losses, color="firebrick", linewidth=2.0,
                marker="o", markersize=5, label="Eval loss")

    # Smoothed train loss (moving average)
    if len(train_losses) > 10:
        window = max(1, len(train_losses) // 20)
        smoothed = np.convolve(train_losses, np.ones(window)/window, mode="valid")
        smooth_steps = train_steps[window-1:]
        ax.plot(smooth_steps, smoothed, color="navy", linewidth=2.5,
                label=f"Train loss (smoothed, w={window})")

    ax.set_xlabel("Training Steps", fontsize=12)
    ax.set_ylabel("Cross-Entropy Loss", fontsize=12)
    ax.set_title("SFT Training β€” Loss Curves", fontsize=14)
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig("sft_loss_curves.png", dpi=120)
    plt.show()
    print("Saved: sft_loss_curves.png")
else:
    print("No loss history available (training may not have completed).")
    print("Generating synthetic loss curve for illustration...")

    # Illustrative synthetic loss curve
    steps = np.arange(1, 501)
    train_loss_synth = 2.5 * np.exp(-steps / 150) + 0.9 + 0.1 * np.random.randn(500) * np.exp(-steps / 300)
    eval_loss_synth  = 2.4 * np.exp(-steps[::20] / 150) + 0.95 + 0.05 * np.random.randn(25)

    fig, ax = plt.subplots(figsize=(10, 5))
    ax.plot(steps, train_loss_synth, color="steelblue", linewidth=1.2, alpha=0.7, label="Train loss")
    ax.plot(steps[::20], eval_loss_synth, color="firebrick", linewidth=2.0,
            marker="o", markersize=5, label="Eval loss")
    ax.set_xlabel("Training Steps", fontsize=12)
    ax.set_ylabel("Cross-Entropy Loss", fontsize=12)
    ax.set_title("SFT Training β€” Loss Curves (Illustrative)", fontsize=14)
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig("sft_loss_curves.png", dpi=120)
    plt.show()

9. Saving CheckpointsΒΆ

What Gets SavedΒΆ

When you call trainer.save_model(), the following files are written:

sft-qwen2.5-1.5b-alpaca/
β”œβ”€β”€ config.json                  # model architecture config
β”œβ”€β”€ tokenizer.json               # tokenizer vocabulary
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ special_tokens_map.json
β”œβ”€β”€ model.safetensors            # model weights (safetensors format)
β”‚   OR
β”œβ”€β”€ pytorch_model.bin            # model weights (older PyTorch format)
└── training_args.bin            # training configuration

Checkpoint vs Final ModelΒΆ

  • Checkpoint (checkpoint-200/, checkpoint-400/): Intermediate saves during training. Includes optimizer state for resuming.

  • Final model (root output dir): The final weights after all training. Smaller because optimizer state is not included.

Resuming from CheckpointΒΆ

trainer.train(resume_from_checkpoint="./output_dir/checkpoint-200")
# or to resume from the latest checkpoint automatically:
trainer.train(resume_from_checkpoint=True)
import os

# Save the final model and tokenizer
print("Saving fine-tuned model...")
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

# Save training metrics
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()

# List saved files
print()
print(f"Files saved to: {OUTPUT_DIR}/")
saved_files = sorted(os.listdir(OUTPUT_DIR))
for f in saved_files:
    full_path = os.path.join(OUTPUT_DIR, f)
    if os.path.isfile(full_path):
        size_mb = os.path.getsize(full_path) / 1e6
        print(f"  {f:<45} {size_mb:>8.1f} MB")
    else:
        print(f"  {f}/ (directory)")

print()
total_mb = sum(
    os.path.getsize(os.path.join(OUTPUT_DIR, f)) / 1e6
    for f in os.listdir(OUTPUT_DIR)
    if os.path.isfile(os.path.join(OUTPUT_DIR, f))
)
print(f"Total model size: {total_mb:.0f} MB ({total_mb/1024:.2f} GB)")
# Load the fine-tuned model and run inference

from transformers import AutoModelForCausalLM, AutoTokenizer

print("Loading fine-tuned model for inference...")
ft_tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR, trust_remote_code=True)
ft_model = AutoModelForCausalLM.from_pretrained(
    OUTPUT_DIR,
    torch_dtype=TORCH_DTYPE,
    device_map="auto",
    trust_remote_code=True,
)
ft_model.eval()


def generate_response(model, tokenizer, instruction: str, input_text: str = "",
                       max_new_tokens: int = 256, temperature: float = 0.7) -> str:
    """Generate a response using the fine-tuned model."""
    if input_text.strip():
        prompt = ALPACA_PROMPT_WITH_INPUT.format(
            instruction=instruction,
            input=input_text,
            output="",  # empty β€” model will complete this
        )
    else:
        prompt = ALPACA_PROMPT_WITHOUT_INPUT.format(
            instruction=instruction,
            output="",
        )
    # Remove the trailing empty "### Response:" text to get clean prompt
    prompt = prompt.rstrip()

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    input_len = inputs["input_ids"].shape[1]

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=temperature > 0,
            top_p=0.9,
            repetition_penalty=1.1,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
        )

    # Decode only the newly generated tokens
    new_tokens = outputs[0][input_len:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()


# Test with a few examples
test_cases = [
    {"instruction": "Explain what photosynthesis is in simple terms.", "input": ""},
    {"instruction": "Write a haiku about programming.", "input": ""},
    {"instruction": "Convert this temperature from Celsius to Fahrenheit.", "input": "100Β°C"},
]

print("=" * 60)
print("Fine-tuned Model Inference")
print("=" * 60)

for i, tc in enumerate(test_cases):
    print(f"\n[Test {i+1}]")
    print(f"Instruction: {tc['instruction']}")
    if tc['input']:
        print(f"Input: {tc['input']}")
    response = generate_response(ft_model, ft_tokenizer, tc["instruction"], tc["input"])
    print(f"Response: {response}")
    print("-" * 60)

10. Troubleshooting Common IssuesΒΆ

CUDA Out of Memory (OOM)ΒΆ

RuntimeError: CUDA out of memory. Tried to allocate X GB.

Solutions (try in order):

  1. Reduce per_device_train_batch_size to 1

  2. Increase gradient_accumulation_steps to compensate

  3. Enable gradient_checkpointing=True

  4. Reduce max_seq_length (e.g., 512 instead of 2048)

  5. Use a smaller model or switch to QLoRA (see Notebook 04)

# Quick OOM debug: clear cache and check memory
torch.cuda.empty_cache()
print(torch.cuda.memory_summary())

Loss is NaNΒΆ

{'loss': nan, 'learning_rate': ...}

Solutions:

  1. Lower learning rate by 10x

  2. Add gradient clipping: max_grad_norm=0.3

  3. Switch from FP16 to BF16 (FP16 has smaller dynamic range)

  4. Check for bad data (empty outputs, very long sequences)

Slow TrainingΒΆ

Solutions:

  1. Enable BF16/FP16 mixed precision

  2. Set dataloader_num_workers=4

  3. Enable dataloader_pin_memory=True

  4. Use optim="adamw_torch_fused"

  5. Install Flash Attention 2

  6. Enable sequence packing (packing=True)

Model Generates Garbage After TrainingΒΆ

Solutions:

  1. Check prompt template matches training format exactly

  2. Verify pad_token is set correctly

  3. Ensure EOS token is included at end of outputs in training data

  4. Check if model was over-trained (too many epochs)

  5. Verify dataset quality β€” bad data = bad output

Training Loss Not DecreasingΒΆ

Solutions:

  1. Check learning rate is not too low

  2. Verify data formatting is correct (outputs not empty)

  3. Check that tokenizer chat template matches model

  4. Ensure gradient checkpointing is compatible with your model version

ImportError / Version ConflictsΒΆ

# Known compatible versions (as of early 2025)
pip install transformers==4.40.0 trl==0.8.6 peft==0.10.0 \
            accelerate==0.29.3 bitsandbytes==0.43.1 datasets==2.18.0
# Free GPU memory after training (useful in notebooks)
import gc

def free_memory(*objects):
    """Delete objects and clear GPU cache."""
    for obj in objects:
        del obj
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
    print("GPU memory freed.")
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        reserved  = torch.cuda.memory_reserved() / 1e9
        print(f"  Allocated: {allocated:.2f} GB")
        print(f"  Reserved : {reserved:.2f} GB")


# Uncomment to free memory:
# free_memory(model, trainer, ft_model, ft_tokenizer)

print("Memory management utilities ready.")
print("Call free_memory(model, trainer) when you are done to reclaim VRAM.")

Summary β€” Complete SFT PipelineΒΆ

Step 1: Load raw dataset
   load_dataset("tatsu-lab/alpaca")
   
Step 2: Format with prompt template
   format_alpaca(example) β†’ {"text": "..."}
   
Step 3: Analyze token lengths
   Choose max_seq_length to cover 95th percentile
   
Step 4: Load model in BF16
   AutoModelForCausalLM.from_pretrained(..., torch_dtype=bfloat16)
   model.gradient_checkpointing_enable()
   
Step 5: Configure SFTConfig
   learning_rate=2e-5, gradient_accumulation_steps=8
   bf16=True, packing=True
   
Step 6: Create SFTTrainer and train
   trainer = SFTTrainer(model, tokenizer, args, train_dataset, eval_dataset)
   trainer.train()
   
Step 7: Monitor loss curves
   Watch for convergence, avoid overfitting
   
Step 8: Save and test
   trainer.save_model(output_dir)
   Generate sample responses

Key TakeawaysΒΆ

  1. Full SFT updates all model weights β€” powerful but memory-intensive

  2. Gradient checkpointing is essential for training on limited VRAM

  3. Mixed precision (BF16) cuts memory nearly in half with no quality loss

  4. Gradient accumulation lets you achieve large effective batch sizes on small GPUs

  5. Sequence packing dramatically improves training throughput (2-3x)

  6. Monitor eval loss β€” stop training when it starts to rise

  7. Flash Attention 2 gives 2-4x speedup for long sequences on Ampere+ GPUs

Next StepsΒΆ

  • Notebook 03 β€” LoRA Basics: reduce trainable parameters to ~0.5% for similar quality

  • Notebook 04 β€” QLoRA: fine-tune 7B+ models on a single consumer GPU

  • Notebook 05 β€” Instruction Tuning: curate better instruction datasets

  • Notebook 07 β€” RLHF/DPO: align model behavior with human preferences

References: