Run this notebook: Open in Colab Open in Kaggle

QLoRA — Memory-Efficient Fine-Tuning on Consumer GPUs¶

Phase 12 — LLM Fine-tuning | Notebook 04

What You Will Learn¶

QLoRA architecture: 4-bit quantized base model + LoRA adapters
BitsAndBytes quantization: NF4 vs FP4 — when each is appropriate
How QLoRA makes fine-tuning a 70B model possible on a 24 GB GPU
Configuring BitsAndBytesConfig correctly
Paged AdamW optimizer — why it matters for large models
Nested (double) quantization — extra memory savings
QLoRA vs LoRA — decision guide by hardware and task
Memory calculation formulas
DoRA (Weight-Decomposed LoRA) — often better than standard LoRA
RSLoRA (Rank-Stabilized LoRA) — stable at high rank values
PEFT method comparison benchmarks
Complete runnable QLoRA example from data to inference

Model used: Qwen/Qwen2.5-1.5B-Instruct
Minimum GPU: 4 GB VRAM (1.5B model in 4-bit)

QLoRA Architecture¶

QLoRA = 4-bit Quantized Base Model  +  LoRA Adapters (16-bit)

┌─────────────────────────────────────────────────────────────┐
│                    Transformer Layer                        │
│                                                             │
│   Input ──► [ W_frozen (NF4 4-bit) ] ──► BF16 output       │
│                     +                                       │
│             [ B · A  (BF16 LoRA)   ] ──► BF16 delta        │
│                     ↑                                       │
│               Only these weights                            │
│               are trained!                                  │
└─────────────────────────────────────────────────────────────┘

Memory breakdown for a 7B model:
  Base weights (NF4 4-bit) : ~3.5 GB   ← frozen, 4x compression
  LoRA adapters (BF16)     : ~0.2 GB   ← trainable
  Gradients                : ~0.2 GB   ← only for LoRA
  Optimizer states         : ~0.4 GB   ← only for LoRA
  Activations              : ~1.5 GB   ← depends on batch/seq
  ─────────────────────────────────────
  Total                    : ~5.8 GB   ← fits in 6 GB GPU!

Original paper: Dettmers et al. (2023), QLoRA: Efficient Finetuning of Quantized LLMs

# Install required libraries
# bitsandbytes is the key library for 4-bit quantization
!pip install -q transformers>=4.40.0 trl>=0.8.6 peft>=0.10.0 datasets>=2.18.0 \
               accelerate>=0.29.0 bitsandbytes>=0.43.0 matplotlib torch

1. BitsAndBytes Quantization: NF4 vs FP4¶

Quantization maps floating point weights to a smaller numerical format. BitsAndBytes supports two 4-bit formats:

NF4 — Normal Float 4¶

NF4 is the key innovation in QLoRA. It exploits the observation that pre-trained neural network weights follow an approximately normal distribution centered at zero.

NF4 places quantization levels unevenly — more levels near zero (where most weights cluster) and fewer at the extremes. This is information-theoretically optimal for normally distributed data.

Standard INT4 (uniform levels):
─────────────────────────────
  ← -2.0  -1.5  -1.0  -0.5  0.0  0.5  1.0  1.5  2.0 →
    ←—|——|——|——|——|——|——|——|——|——→  (equally spaced)

NF4 (non-uniform, optimized for normal distribution):
─────────────────────────────────────────────────────
  ← -1.0   -0.5       0    0.5   1.0 →
    |—||—|||||||—|||—||||||—|——|—|   (dense near 0, sparse at extremes)

NF4 achieves lower quantization error than INT4 or FP4 for weights that follow a normal distribution.

FP4 — Float Point 4¶

FP4 uses 1 sign bit + 2 exponent bits + 1 mantissa bit, mimicking the structure of FP32/BF16. It handles a wider dynamic range than NF4 but is less optimal for normally distributed weights.

Comparison Table¶

Format	Best For	Quantization Error	Memory
FP32	Training, high precision	None (baseline)	4 bytes/param
BF16	Training (Ampere+)	Minimal	2 bytes/param
FP16	Training (V100, T4)	Minimal	2 bytes/param
INT8	Inference	Low	1 byte/param
NF4	QLoRA fine-tuning (normal dist.)	Low (best for LLMs)	0.5 bytes/param
FP4	QLoRA fine-tuning (general)	Moderate	0.5 bytes/param
INT4	Inference only	High	0.5 bytes/param

For QLoRA, always use NF4 unless you have a specific reason to use FP4.

How BitsAndBytes Quantization Works¶

BitsAndBytes uses blockwise quantization:

Divide weights into blocks of 64 elements
Find the maximum absolute value in each block (the scale)
Normalize block to [-1, 1] range
Quantize to NF4/FP4 levels
Store: quantized values (4-bit) + per-block scale (FP32)

The per-block scale adds ~0.5 bits/param overhead → effective ~4.5 bits/param total.

import os
import sys
import gc
import math
import warnings
from typing import Dict, Optional, Tuple

import torch
import transformers
import trl
import peft
import datasets as ds_lib
import bitsandbytes as bnb
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np

warnings.filterwarnings("ignore")

print("=" * 55)
print("Environment Check")
print("=" * 55)
print(f"Python          : {sys.version.split()[0]}")
print(f"PyTorch         : {torch.__version__}")
print(f"Transformers    : {transformers.__version__}")
print(f"TRL             : {trl.__version__}")
print(f"PEFT            : {peft.__version__}")
print(f"BitsAndBytes    : {bnb.__version__}")
print(f"Datasets        : {ds_lib.__version__}")
print()

if torch.cuda.is_available():
    device = "cuda"
    gpu_name  = torch.cuda.get_device_name(0)
    vram_gb   = torch.cuda.get_device_properties(0).total_memory / 1e9
    major, minor = torch.cuda.get_device_capability()
    print(f"GPU             : {gpu_name}")
    print(f"VRAM            : {vram_gb:.1f} GB")
    print(f"Compute Cap.    : SM{major}{minor}")
    print(f"BF16 supported  : {torch.cuda.is_bf16_supported()}")
    print(f"4-bit supported : {major >= 7}  (SM70+ Volta or newer)")
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = "mps"
    print("GPU             : Apple Silicon (MPS)")
    print("Note: 4-bit quantization not supported on MPS. Using CPU fallback.")
else:
    device = "cpu"
    print("GPU             : Not available")

print(f"\nDevice          : {device}")
print("=" * 55)

2. Memory Calculation: Why QLoRA Enables 70B on Consumer GPUs¶

Let us do the math precisely.

Full Fine-Tuning Memory Requirements¶

For a model with P parameters trained in FP16/BF16:

Component	Formula	7B example	70B example
Model weights	P × 2 bytes	14 GB	140 GB
Gradients	P × 2 bytes	14 GB	140 GB
AdamW momentum	P × 4 bytes	28 GB	280 GB
AdamW variance	P × 4 bytes	28 GB	280 GB
Activations	~batch × seq × hidden × layers × 2	2–8 GB	4–16 GB
Total	~12 × P bytes	~86 GB	~860 GB

QLoRA Memory Requirements¶

With base model quantized to NF4 and only LoRA adapters trained:

Component	Formula	7B example	70B example
Base model (NF4)	P × 0.5 bytes	3.5 GB	35 GB
LoRA weights (BF16)	r × (d_in + d_out) × L × 2	~0.2 GB	~0.5 GB
Gradients (LoRA only)	same as LoRA	~0.2 GB	~0.5 GB
Paged AdamW (LoRA)	LoRA × 8 bytes	~0.4 GB	~1.0 GB
Activations	same formula	1–4 GB	2–8 GB
Total		~5.3–8 GB	~39–45 GB

Compression Ratio¶

7B model: Full SFT needs ~86 GB → QLoRA needs ~6 GB → 14x reduction
70B model: Full SFT needs ~860 GB → QLoRA needs ~40 GB → 21x reduction

This is why QLoRA is transformative: a single RTX 3090 (24 GB) can fine-tune a 70B parameter model that would otherwise require 10× A100 GPUs.

def calculate_model_memory(
    params_billions: float,
    lora_rank: int = 16,
    hidden_size: int = 4096,
    num_layers: int = 32,
    num_lora_target_modules: int = 4,  # q, k, v, o projections
) -> Dict[str, float]:
    """
    Calculate GPU memory requirements for different training strategies.
    Returns memory in GB for each component and strategy.
    """
    P = params_billions * 1e9

    # LoRA parameter count: r * (d_in + d_out) per module
    lora_params = lora_rank * 2 * hidden_size * num_layers * num_lora_target_modules

    strategies = {}

    # Full fine-tuning in FP32
    strategies["Full SFT (FP32)"] = {
        "model":     P * 4 / 1e9,
        "gradients": P * 4 / 1e9,
        "optimizer": P * 8 / 1e9,  # AdamW: 2 × FP32 states
        "activations": params_billions * 0.5,  # rough estimate
    }

    # Full fine-tuning in BF16
    strategies["Full SFT (BF16)"] = {
        "model":     P * 2 / 1e9,
        "gradients": P * 2 / 1e9,
        "optimizer": P * 8 / 1e9,  # optimizer states kept in FP32
        "activations": params_billions * 0.4,
    }

    # LoRA in BF16 (base model not quantized)
    strategies["LoRA (BF16 base)"] = {
        "model":     P * 2 / 1e9,
        "gradients": lora_params * 2 / 1e9,
        "optimizer": lora_params * 8 / 1e9,
        "activations": params_billions * 0.4,
    }

    # QLoRA: 4-bit base + LoRA adapters in BF16
    strategies["QLoRA (NF4 base)"] = {
        "model":     P * 0.5 / 1e9,          # 4-bit = ~0.5 bytes/param
        "gradients": lora_params * 2 / 1e9,  # only LoRA gradients
        "optimizer": lora_params * 8 / 1e9,  # only LoRA optimizer states
        "activations": params_billions * 0.3,
    }

    for name, breakdown in strategies.items():
        breakdown["total"] = sum(breakdown.values())

    return strategies


# Print memory tables for different model sizes
model_configs = [
    {"name": "1.5B", "params": 1.5, "hidden": 1536, "layers": 28},
    {"name": "7B",   "params": 7.0, "hidden": 4096, "layers": 32},
    {"name": "13B",  "params": 13.0,"hidden": 5120, "layers": 40},
    {"name": "70B",  "params": 70.0,"hidden": 8192, "layers": 80},
]

for cfg in model_configs:
    print(f"\n{'='*65}")
    print(f"Model: {cfg['name']} parameters  (hidden={cfg['hidden']}, layers={cfg['layers']})")
    print(f"{'='*65}")
    strategies = calculate_model_memory(
        cfg["params"], hidden_size=cfg["hidden"], num_layers=cfg["layers"]
    )
    print(f"{'Strategy':<22} {'Model':>7} {'Grads':>7} {'Optim':>7} {'Activ':>7} {'TOTAL':>8}")
    print("-" * 65)
    for strategy_name, mem in strategies.items():
        print(
            f"{strategy_name:<22} "
            f"{mem['model']:>6.1f}G "
            f"{mem['gradients']:>6.2f}G "
            f"{mem['optimizer']:>6.2f}G "
            f"{mem['activations']:>6.1f}G "
            f"{mem['total']:>7.1f}G"
        )

3. Configuring BitsAndBytesConfig¶

BitsAndBytesConfig is the Hugging Face wrapper for bitsandbytes quantization settings. Here are all the key parameters explained:

Core Parameters¶

BitsAndBytesConfig(
    load_in_4bit=True,               # Enable 4-bit loading (vs load_in_8bit)
    bnb_4bit_quant_type="nf4",       # "nf4" (best for LLMs) or "fp4"
    bnb_4bit_compute_dtype=bfloat16, # Upcast to this dtype for forward pass
    bnb_4bit_use_double_quant=True,  # Double quantization (nested quant)
)

`bnb_4bit_compute_dtype`¶

Weights are stored in 4-bit but computations (matrix multiplications) are upcasted to this dtype. Think of it as:

Storage: NF4 (0.5 bytes/param) — saves memory
Compute: BF16 or FP16 — maintains numerical stability

Use torch.bfloat16 on Ampere+ GPUs, torch.float16 on older GPUs.

Nested Quantization (Double Quant)¶

The quantization scale factors themselves (normally FP32) are also quantized to 8-bit. This adds another ~0.37 bits/param in savings:

Without double quant: ~4.5 bits/param effective
With double quant: ~4.13 bits/param effective

For a 70B model: saves an additional ~3 GB of VRAM. Always enable it.

8-bit Loading (Alternative)¶

BitsAndBytesConfig(load_in_8bit=True)  # INT8 quantization

Uses LLM.int8() mixed-precision decomposition. Better quality than 4-bit but uses 2x the memory. Good for inference when quality matters more than memory.

from transformers import BitsAndBytesConfig

# Determine compute dtype based on GPU capability
if torch.cuda.is_available() and torch.cuda.is_bf16_supported():
    compute_dtype = torch.bfloat16
    compute_dtype_str = "bfloat16"
else:
    compute_dtype = torch.float16
    compute_dtype_str = "float16"

print("BitsAndBytesConfig Configurations")
print("=" * 55)

# ── Configuration 1: NF4 with double quantization (recommended) ──
bnb_config_nf4 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",             # Normal Float 4 — optimal for LLM weights
    bnb_4bit_compute_dtype=compute_dtype,  # upcast to BF16/FP16 for matmul
    bnb_4bit_use_double_quant=True,        # quantize the scale factors too
)

print("\nConfig 1: NF4 + Double Quantization (RECOMMENDED for QLoRA)")
print(f"  quant_type     : nf4")
print(f"  compute_dtype  : {compute_dtype_str}")
print(f"  double_quant   : True")
print(f"  effective bits : ~4.13 bits/param")
print(f"  7B model VRAM  : ~3.5 GB")

# ── Configuration 2: NF4 without double quantization ──
bnb_config_nf4_nodq = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,       # no nested quantization
)

print("\nConfig 2: NF4 without Double Quantization")
print(f"  quant_type     : nf4")
print(f"  compute_dtype  : {compute_dtype_str}")
print(f"  double_quant   : False")
print(f"  effective bits : ~4.5 bits/param")
print(f"  7B model VRAM  : ~3.9 GB")

# ── Configuration 3: FP4 (alternative) ──
bnb_config_fp4 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="fp4",             # Float Point 4 (less optimal for LLMs)
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)

print("\nConfig 3: FP4 + Double Quantization")
print(f"  quant_type     : fp4")
print(f"  compute_dtype  : {compute_dtype_str}")
print(f"  double_quant   : True")
print(f"  note           : Higher quantization error for LLMs than NF4")

# ── Configuration 4: INT8 (for inference or high-quality 8-bit) ──
bnb_config_int8 = BitsAndBytesConfig(
    load_in_8bit=True,                     # INT8 via LLM.int8()
    llm_int8_threshold=6.0,                # outlier threshold for mixed-precision
    llm_int8_has_fp16_weight=False,
)

print("\nConfig 4: INT8 (LLM.int8())")
print(f"  load_in_8bit   : True")
print(f"  threshold      : 6.0 (outlier columns stay in FP16)")
print(f"  effective bits : ~8.5 bits/param")
print(f"  7B model VRAM  : ~7 GB")
print(f"  note           : Better quality than 4-bit, 2x more memory")

print()
print("Using bnb_config_nf4 (Config 1) for all further examples.")

# This is the config we will use
BNB_CONFIG = bnb_config_nf4

4. Paged AdamW Optimizer¶

The Problem: Optimizer State Memory Spikes¶

Standard AdamW keeps optimizer states (momentum + variance) for all trainable parameters in GPU VRAM at all times. For large models, this causes:

Sudden OOM crashes during gradient updates
Inability to use large batch sizes
GPU memory fragmentation over time

The Solution: Paged Memory¶

NVIDIA’s unified memory system allows GPU memory to be paged to CPU RAM when the GPU runs out, similar to how virtual memory pages from RAM to disk in operating systems.

Paged AdamW (paged_adamw_32bit, paged_adamw_8bit) stores optimizer states in pageable GPU memory. When VRAM runs low:

Optimizer states are evicted to CPU RAM
GPU continues processing
States are paged back when needed

This eliminates OOM crashes from optimizer state spikes without sacrificing quality.

Available Optimizers for QLoRA¶

Optimizer	Memory	Speed	Quality	Use Case
`adamw_torch`	High	Fast	Best	Enough VRAM
`adamw_torch_fused`	High	Fastest	Best	Large GPU
`paged_adamw_32bit`	Medium	Moderate	Best	Limited VRAM
`paged_adamw_8bit`	Low	Moderate	Near-best	Minimal VRAM
`adamw_bnb_8bit`	Low	Fast	Near-best	Production QLoRA
`sgd`	Lowest	Fast	Lower	Research

Recommendation: Use paged_adamw_8bit for QLoRA on GPUs with <16 GB VRAM.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import prepare_model_for_kbit_training

MODEL_ID = "Qwen/Qwen2.5-1.5B-Instruct"


def get_gpu_memory_gb() -> Tuple[float, float]:
    """Return (allocated_GB, total_GB) GPU memory."""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        total     = torch.cuda.get_device_properties(0).total_memory / 1e9
        return allocated, total
    return 0.0, 0.0


# --- Memory before loading ---
mem_before, total_vram = get_gpu_memory_gb()
print(f"VRAM before loading: {mem_before:.2f} GB / {total_vram:.1f} GB")
print()

# --- Load tokenizer ---
print(f"Loading tokenizer: {MODEL_ID}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id
print(f"Vocab size: {tokenizer.vocab_size:,}")
print(f"Pad token : {tokenizer.pad_token!r} (id={tokenizer.pad_token_id})")
print()

# --- Load 4-bit quantized model ---
print(f"Loading 4-bit quantized model: {MODEL_ID}")
print(f"Quantization: NF4 + double quant + {compute_dtype_str} compute")

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=BNB_CONFIG,       # 4-bit NF4 with double quantization
    device_map="auto",                    # auto-distribute across available GPUs
    trust_remote_code=True,
    attn_implementation="eager",          # use "flash_attention_2" if available
)

# --- Memory after loading ---
mem_after, _ = get_gpu_memory_gb()
print()
print(f"VRAM after loading : {mem_after:.2f} GB / {total_vram:.1f} GB")
print(f"Model memory       : {mem_after - mem_before:.2f} GB")
print()

# --- Prepare model for k-bit training ---
# This function:
#   1. Enables gradient checkpointing
#   2. Upcasts LayerNorm weights to FP32 (for numerical stability)
#   3. Sets model in training mode
model = prepare_model_for_kbit_training(
    model,
    use_gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
)

total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters  : {total_params:,} ({total_params/1e9:.2f}B)")
print("Model ready for LoRA adapter attachment.")

5. QLoRA vs LoRA: Decision Guide¶

Both QLoRA and LoRA train only a small fraction of parameters via low-rank adapter matrices. The key difference is the base model precision:

	LoRA	QLoRA
Base model	BF16/FP16 (full precision)	NF4 4-bit (quantized)
Base model memory	2 bytes/param	0.5 bytes/param
Training quality	Slightly higher	Very close to LoRA
Minimum VRAM (7B)	~16 GB	~6 GB
Speed	Faster matmul	~15% slower (dequant overhead)
Quantization noise	None	Small

When to Use LoRA (not QLoRA)¶

You have 16+ GB VRAM and are fine-tuning a 7B model
You need absolute maximum quality (competition/research)
Speed is more critical than memory efficiency

When to Use QLoRA¶

Consumer GPU with 8–24 GB VRAM
Fine-tuning 7B+ models on limited hardware
Cost-sensitive (renting cloud GPUs by the hour)
The model is large enough that BF16 loading is impossible

Practical Rule of Thumb¶

VRAM available > 2 × model_size_BF16?  → Use LoRA
Otherwise?                              → Use QLoRA

Examples:
  24 GB GPU, 7B model (14 GB in BF16)  → LoRA (24 > 14)
  16 GB GPU, 13B model (26 GB in BF16) → QLoRA (16 < 26)
   8 GB GPU, 7B model (14 GB in BF16)  → QLoRA (8 < 14)

from peft import LoraConfig, get_peft_model, TaskType

# Find which linear layer names exist in this model
# This is needed to set target_modules correctly
def find_linear_layer_names(model) -> list:
    """Find all linear layer module names suitable for LoRA."""
    linear_names = set()
    for name, module in model.named_modules():
        if isinstance(module, (torch.nn.Linear, bnb.nn.Linear4bit, bnb.nn.Linear8bitLt)):
            # Get the last part of the name (e.g., "q_proj" from "model.layers.0.self_attn.q_proj")
            parts = name.split(".")
            last_name = parts[-1]
            # Skip output embedding layers
            if last_name not in ["lm_head", "embed_tokens"]:
                linear_names.add(last_name)
    return sorted(list(linear_names))


all_linear_layers = find_linear_layer_names(model)
print("All trainable linear layers in the model:")
for name in all_linear_layers:
    print(f"  {name}")

# Standard target modules for transformer models:
# q_proj, k_proj, v_proj, o_proj = attention projections
# gate_proj, up_proj, down_proj  = MLP/FFN projections
# Targeting all of them (vs just q+v) gives better quality at slightly more memory

ATTENTION_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj"]
MLP_MODULES       = ["gate_proj", "up_proj", "down_proj"]

# Use modules that actually exist in this model
target_attention = [m for m in ATTENTION_MODULES if m in all_linear_layers]
target_mlp       = [m for m in MLP_MODULES if m in all_linear_layers]
target_modules   = target_attention + target_mlp

print(f"\nSelected target modules: {target_modules}")
print(f"  Attention: {target_attention}")
print(f"  MLP/FFN:   {target_mlp}")

# -----------------------------------------------------------------------
# LoRA Configuration for QLoRA
# -----------------------------------------------------------------------
lora_config = LoraConfig(
    # --- Core LoRA parameters ---
    r=16,                          # Rank: higher = more capacity, more memory
                                   # Typical values: 8, 16, 32, 64
                                   # r=8:  minimal; r=64: near full fine-tuning quality

    lora_alpha=32,                 # Scaling factor: effective LR = (alpha/r) × LR
                                   # Convention: set to 2×r (so scaling=2.0)
                                   # Higher alpha = stronger adapter influence

    lora_dropout=0.05,             # Dropout applied to LoRA layers
                                   # 0.05–0.1 is typical; 0.0 for small datasets

    bias="none",                   # Whether to train bias terms
                                   # "none" = no bias (most common)
                                   # "all" = train all biases
                                   # "lora_only" = train only LoRA biases

    # --- Target modules ---
    target_modules=target_modules, # Which linear layers to add LoRA to
                                   # More modules = better quality, more params

    # --- Task type ---
    task_type=TaskType.CAUSAL_LM,  # Autoregressive language modeling

    # --- Advanced options (will explore below) ---
    use_dora=False,                # DoRA: Weight-Decomposed LoRA (see section 7)
    use_rslora=False,              # RSLoRA: Rank-Stabilized (see section 8)
    # init_lora_weights="gaussian", # Weight initialization (default: kaiming)
)

# Apply LoRA to the quantized model
model = get_peft_model(model, lora_config)

# Count parameters
def print_param_summary(model):
    total     = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    frozen    = total - trainable
    print(f"  Total params    : {total:>15,} ({total/1e9:.3f}B)")
    print(f"  Trainable (LoRA): {trainable:>15,} ({trainable/total*100:.3f}%)")
    print(f"  Frozen (base)   : {frozen:>15,} ({frozen/total*100:.3f}%)")

print("LoRA applied to quantized model:")
print_param_summary(model)

print()
mem_with_lora, _ = get_gpu_memory_gb()
print(f"VRAM with base + LoRA: {mem_with_lora:.2f} GB")
print()
model.print_trainable_parameters()

6. DoRA — Weight-Decomposed Low-Rank Adaptation¶

Paper: Liu et al. (2024), DoRA: Weight-Decomposed Low-Rank Adaptation

DoRA is a drop-in improvement over LoRA that consistently achieves better performance with the same parameter budget.

The Insight¶

DoRA decomposes each weight matrix into magnitude and direction components (inspired by weight normalization):

\[W = \underbrace{m}_{\text{magnitude}} \cdot \underbrace{\frac{W_0 + BA}{\|W_0 + BA\|_c}}_{\text{direction}}\]

\(W_0\): frozen pre-trained weight
\(B\), \(A\): trainable LoRA matrices (low-rank, as in standard LoRA)
\(m\): trainable magnitude vector (one scalar per output feature)

Why It Works Better¶

Standard LoRA is constrained to make updates in a low-rank subspace. This limits how independently it can adjust the magnitude vs the direction of each weight vector.

DoRA separates these, allowing:

Magnitude adjustments without changing direction (and vice versa)
Learning patterns more similar to full fine-tuning
Better performance at the same rank

Benchmark Results (from the DoRA paper)¶

Method	Commonsense (avg)	MMLU	MT-Bench	Extra params
Full FT	78.6	56.1	6.8	100%
LoRA (r=32)	74.0	53.2	6.4	0.8%
DoRA (r=32)	76.8	55.0	6.6	0.8%

DoRA consistently outperforms LoRA by 2-4% across tasks with identical parameter counts.

When to Use DoRA¶

When LoRA performance is close but not quite good enough
When you want closer-to-full-FT quality without increasing rank
When training on complex reasoning or instruction following tasks

Minor drawback: ~5% slower than LoRA due to extra normalization.

# DoRA configuration — only requires use_dora=True in LoraConfig

dora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    target_modules=target_modules,
    task_type=TaskType.CAUSAL_LM,
    use_dora=True,          # <── the only change from standard LoRA
    use_rslora=False,
)

print("DoRA Config (use_dora=True)")
print(f"  rank          : {dora_config.r}")
print(f"  lora_alpha    : {dora_config.lora_alpha}")
print(f"  use_dora      : {dora_config.use_dora}")
print()

# Compare parameter counts: DoRA vs LoRA
print("Parameter comparison for a 7B model (hypothetical, r=16):")
print("-" * 60)

# Approximate LoRA param count for 7B model
hidden_size    = 4096
num_layers     = 32
rank           = 16
n_target_mods  = 7   # q, k, v, o, gate, up, down

lora_params    = 2 * rank * hidden_size * num_layers * n_target_mods
# DoRA adds a magnitude vector per output feature per targeted module
dora_extra     = hidden_size * num_layers * n_target_mods  # magnitude vectors
dora_params    = lora_params + dora_extra

print(f"  LoRA params  : {lora_params:>12,}  ({lora_params/7e9*100:.4f}% of 7B)")
print(f"  DoRA params  : {dora_params:>12,}  ({dora_params/7e9*100:.4f}% of 7B)")
print(f"  DoRA overhead: {dora_extra:>12,}  (+{dora_extra/lora_params*100:.1f}% vs LoRA)")
print()
print("DoRA adds only minimal extra parameters (~4% more than LoRA)")
print("but consistently achieves better fine-tuning quality.")

7. RSLoRA — Rank-Stabilized LoRA¶

Paper: Kalajdzievski (2023), A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

The Problem with Standard LoRA Scaling¶

In standard LoRA, the adapter output is scaled by alpha / r:

\[\Delta W = \frac{\alpha}{r} \cdot B \cdot A\]

This scaling factor decreases as rank increases. The problem:

At low rank (r=8): scaling = alpha/8 (relatively large)
At high rank (r=64): scaling = alpha/64 (small, learning is slow)

In practice this means standard LoRA becomes harder to train at high ranks, limiting its effectiveness even when you add more parameters.

RSLoRA Fix: Scale by \(\sqrt{r}\) Instead¶

RSLoRA replaces the alpha / r scaling with alpha / sqrt(r):

\[\Delta W_{RSLoRA} = \frac{\alpha}{\sqrt{r}} \cdot B \cdot A\]

This ensures the gradient norm remains approximately constant regardless of rank, making optimization stable across all rank values.

Impact¶

Rank	LoRA scaling	RSLoRA scaling	RSLoRA advantage
r=8	alpha/8 = 4.0	alpha/√8 = 11.3	~2.8x
r=16	alpha/16 = 2.0	alpha/√16 = 8.0	~4.0x
r=32	alpha/32 = 1.0	alpha/√32 = 5.7	~5.7x
r=64	alpha/64 = 0.5	alpha/√64 = 4.0	~8.0x

(With alpha=32 as example)

RSLoRA is especially beneficial when using high ranks (r=32+). For low ranks (r=8), the difference is minimal.

When to Use RSLoRA¶

When using rank ≥ 32
When standard LoRA seems to converge slowly
When you want more stable training dynamics

# RSLoRA configuration

rslora_config = LoraConfig(
    r=32,                   # Higher rank benefits more from RSLoRA
    lora_alpha=32,          # With RSLoRA, alpha/sqrt(r) = 32/sqrt(32) ≈ 5.66
    lora_dropout=0.05,
    bias="none",
    target_modules=target_modules,
    task_type=TaskType.CAUSAL_LM,
    use_rslora=True,        # <── enable rank-stabilized scaling
    use_dora=False,
)

# DoRA + RSLoRA combined (they are complementary)
dora_rslora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    target_modules=target_modules,
    task_type=TaskType.CAUSAL_LM,
    use_rslora=True,        # rank-stabilized scaling
    use_dora=True,          # weight decomposition
)

print("Scaling factor comparison (alpha=32):")
print("-" * 50)
print(f"{'Rank':<6} {'LoRA (a/r)':>12} {'RSLoRA (a/√r)':>14} {'Ratio':>8}")
print("-" * 50)
alpha = 32
for r in [4, 8, 16, 32, 64, 128]:
    lora_scale   = alpha / r
    rslora_scale = alpha / math.sqrt(r)
    ratio        = rslora_scale / lora_scale
    print(f"{r:<6} {lora_scale:>12.4f} {rslora_scale:>14.4f} {ratio:>8.2f}x")

print()
print("Summary of LoRA variants:")
print("-" * 50)
variants = [
    ("Standard LoRA",  False, False, "Baseline. Good for r≤16"),
    ("RSLoRA",         True,  False, "Better at r≥32; stable gradients"),
    ("DoRA",           False, True,  "Better quality; +~5% slower"),
    ("DoRA + RSLoRA",  True,  True,  "Best of both worlds for r≥32"),
]
print(f"{'Variant':<18} {'RSLoRA':>7} {'DoRA':>7}  Description")
print("-" * 65)
for name, rslora, dora, desc in variants:
    print(f"{name:<18} {str(rslora):>7} {str(dora):>7}  {desc}")

8. PEFT Methods Comparison¶

Here is a comprehensive benchmark comparison of all major PEFT methods. Numbers are from published papers and community reproductions on LLaMA-2-7B.

Quality Benchmarks (LLaMA-2-7B, instruction tuning)¶

Method	Trainable %	MMLU	MT-Bench	Commonsense	Memory (7B)	Notes
Full SFT	100%	58.1	7.1	79.2	~86 GB	Requires A100×4
LoRA r=64	1.6%	57.2	6.9	78.3	~20 GB	Near full quality
LoRA r=16	0.4%	56.5	6.7	76.8	~18 GB	Best value
DoRA r=16	0.4%	57.0	6.9	77.8	~18 GB	Recommended
QLoRA r=16	0.4%	55.8	6.5	75.9	~6 GB	Consumer GPUs
QLoRA+DoRA r=16	0.4%	56.4	6.7	76.7	~6.1 GB	QLoRA + quality
Adapters (Houlsby)	0.9%	54.1	6.1	73.2	~16 GB	Sequential bottleneck
Prefix Tuning	0.1%	51.2	5.8	70.1	~14 GB	Soft prompts only
Prompt Tuning	<0.01%	48.5	5.3	67.8	~14 GB	Simplest PEFT
IA3	0.01%	52.1	5.9	71.3	~14 GB	Few-shot focused

Memory vs Quality Trade-off¶

Quality
  ▲
  │  Full SFT ●
  │               LoRA r=64 ●
  │          DoRA r=16 ●
  │        LoRA r=16 ●    QDoRA ●
  │                       QLoRA r=16 ●
  │  IA3 ●  Adapters ●
  │  Prefix ●
  │
  └────────────────────────────────► Memory
    6GB   16GB   20GB   86GB

Sweet spots: LoRA r=16 (16GB), QLoRA r=16 (6GB)

Speed Comparison (tokens/second on A100 80GB, 7B model, seq_len=2048)¶

Method	Tokens/sec	Relative Speed
Full SFT	1,850	1.00× (baseline)
LoRA + FA2	2,100	1.14×
DoRA	1,980	1.07×
QLoRA	1,200	0.65×

QLoRA is slower due to dequantization overhead on every forward pass.

# Visualize PEFT methods: quality vs memory trade-off

methods = [
    # (name, trainable_pct, memory_7b_gb, mmlu_score)
    ("Full SFT",       100.0, 86.0, 58.1),
    ("LoRA r=64",      1.6,   20.0, 57.2),
    ("DoRA r=16",      0.4,   18.5, 57.0),
    ("LoRA r=16",      0.4,   18.0, 56.5),
    ("QLoRA+DoRA r=16",0.4,    6.1, 56.4),
    ("QLoRA r=16",     0.4,    6.0, 55.8),
    ("Adapters",       0.9,   16.0, 54.1),
    ("IA3",            0.01,  14.0, 52.1),
    ("Prefix Tuning",  0.1,   14.0, 51.2),
    ("Prompt Tuning",  0.005, 14.0, 48.5),
]

names    = [m[0] for m in methods]
memory   = [m[2] for m in methods]
quality  = [m[3] for m in methods]
trainable = [m[1] for m in methods]

# Color by category
colors = [
    "#e74c3c",  # Full SFT — red
    "#3498db",  # LoRA r=64 — blue
    "#9b59b6",  # DoRA r=16 — purple
    "#2980b9",  # LoRA r=16 — blue (darker)
    "#1abc9c",  # QLoRA+DoRA — teal
    "#27ae60",  # QLoRA r=16 — green
    "#f39c12",  # Adapters — orange
    "#95a5a6",  # IA3 — gray
    "#bdc3c7",  # Prefix — light gray
    "#ecf0f1",  # Prompt — very light
]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Memory vs Quality
scatter = ax1.scatter(memory, quality, s=200, c=colors, edgecolors="#2c3e50",
                       linewidths=1.5, zorder=5)
for i, name in enumerate(names):
    offset_x = 0.5
    offset_y = 0.1
    # Avoid overlap for close points
    if name in ["LoRA r=16", "DoRA r=16", "QLoRA+DoRA r=16"]:
        offset_y = -0.3 if i % 2 == 0 else 0.2
    ax1.annotate(name, (memory[i], quality[i]),
                 xytext=(offset_x, offset_y), textcoords="offset points",
                 fontsize=8, ha="left")

ax1.set_xlabel("GPU Memory Required — 7B Model (GB)", fontsize=11)
ax1.set_ylabel("MMLU Score", fontsize=11)
ax1.set_title("PEFT Methods: Memory vs Quality Trade-off", fontsize=12)
ax1.grid(True, alpha=0.3)
ax1.set_xlim(-2, 95)
ax1.set_ylim(46, 60)

# Highlight the recommended zone
ax1.axvspan(0, 10, alpha=0.08, color="green", label="Consumer GPU zone")
ax1.axvspan(10, 30, alpha=0.05, color="blue", label="Workstation GPU zone")
ax1.legend(fontsize=9, loc="lower right")

# Plot 2: Trainable parameters vs Quality (bar chart)
sorted_idx = sorted(range(len(quality)), key=lambda i: quality[i], reverse=True)
sorted_names   = [names[i] for i in sorted_idx]
sorted_quality = [quality[i] for i in sorted_idx]
sorted_colors  = [colors[i] for i in sorted_idx]

bars = ax2.barh(range(len(sorted_names)), sorted_quality,
                color=sorted_colors, edgecolor="#2c3e50", linewidth=0.8)
ax2.set_yticks(range(len(sorted_names)))
ax2.set_yticklabels(sorted_names, fontsize=9)
ax2.set_xlabel("MMLU Score", fontsize=11)
ax2.set_title("PEFT Methods: Ranked by Quality", fontsize=12)
ax2.set_xlim(45, 61)
ax2.axvline(58.1, color="#e74c3c", linestyle="--", alpha=0.5, label="Full SFT")
ax2.legend(fontsize=9)
ax2.grid(True, alpha=0.3, axis="x")

# Add score labels on bars
for i, (bar, score) in enumerate(zip(bars, sorted_quality)):
    ax2.text(score + 0.05, bar.get_y() + bar.get_height()/2,
             f"{score}", va="center", fontsize=8)

plt.tight_layout()
plt.savefig("peft_comparison.png", dpi=120, bbox_inches="tight")
plt.show()
print("Saved: peft_comparison.png")

from datasets import load_dataset

# Load dataset (same alpaca format as notebook 02)
print("Loading dataset for QLoRA training...")
raw_dataset = load_dataset("tatsu-lab/alpaca", split="train")

# Format function
ALPACA_PROMPT_WITH_INPUT = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

ALPACA_PROMPT_WITHOUT_INPUT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{output}"""


def format_alpaca(example: Dict) -> Dict:
    if example.get("input", "").strip():
        text = ALPACA_PROMPT_WITH_INPUT.format(
            instruction=example["instruction"],
            input=example["input"],
            output=example["output"],
        )
    else:
        text = ALPACA_PROMPT_WITHOUT_INPUT.format(
            instruction=example["instruction"],
            output=example["output"],
        )
    return {"text": text}


formatted = raw_dataset.map(
    format_alpaca,
    remove_columns=raw_dataset.column_names,
    desc="Formatting dataset",
)

split = formatted.train_test_split(test_size=0.02, seed=42)
train_dataset = split["train"]
eval_dataset  = split["test"]

print(f"Train: {len(train_dataset):,} examples")
print(f"Eval : {len(eval_dataset):,} examples")
print()
print("Sample:")
print(train_dataset[0]["text"][:300])
print("...")

from trl import SFTTrainer, SFTConfig

OUTPUT_DIR = "./qlora-qwen2.5-1.5b-alpaca"
MAX_SEQ_LENGTH = 1024

# -----------------------------------------------------------------------
# QLoRA-specific training arguments
# -----------------------------------------------------------------------
qlora_training_args = SFTConfig(
    output_dir=OUTPUT_DIR,

    # --- Dataset ---
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_text_field="text",
    packing=True,
    dataset_num_proc=4,

    # --- Batch / accumulation ---
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,         # effective batch = 16

    # --- Optimizer: paged AdamW for QLoRA ---
    # paged_adamw_8bit: 8-bit optimizer states stored in pageable memory
    # This is the key optimizer choice for memory-constrained QLoRA
    optim="paged_adamw_8bit",
    learning_rate=2e-4,                    # Higher LR for LoRA vs full SFT
    weight_decay=0.001,
    max_grad_norm=0.3,                     # More conservative clipping for 4-bit
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e-8,

    # --- LR schedule ---
    num_train_epochs=1,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,

    # --- Mixed precision ---
    # For QLoRA: do NOT use fp16=True if using bfloat16 compute dtype
    # The base model is stored as NF4 and computation is in bfloat16
    bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(),
    fp16=torch.cuda.is_available() and not torch.cuda.is_bf16_supported(),

    # --- Memory ---
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},

    # --- Checkpointing ---
    save_strategy="steps",
    save_steps=200,
    save_total_limit=3,
    load_best_model_at_end=True,

    # --- Evaluation ---
    eval_strategy="steps",
    eval_steps=200,

    # --- Logging ---
    logging_steps=10,
    logging_first_step=True,
    report_to="none",

    seed=42,
)

print("QLoRA Training Configuration")
print("=" * 55)
print(f"  Optimizer           : {qlora_training_args.optim}")
print(f"  Learning rate       : {qlora_training_args.learning_rate}")
print(f"  Effective batch     : {qlora_training_args.per_device_train_batch_size * qlora_training_args.gradient_accumulation_steps}")
print(f"  BF16                : {qlora_training_args.bf16}")
print(f"  Max seq length      : {qlora_training_args.max_seq_length}")
print(f"  Packing             : {qlora_training_args.packing}")
print()
print("Key QLoRA training choices:")
print("  1. paged_adamw_8bit  — optimizer states in pageable GPU memory")
print("  2. LR = 2e-4         — higher than full SFT (LoRA-specific)")
print("  3. max_grad_norm=0.3 — conservative clipping for 4-bit training")
print("  4. packing=True      — max throughput with short sequences")

# Memory tracking throughout the QLoRA training pipeline

def memory_snapshot(label: str) -> Dict[str, float]:
    """Capture a labeled GPU memory snapshot."""
    snapshot = {"label": label}
    if torch.cuda.is_available():
        torch.cuda.synchronize()
        snapshot["allocated_gb"] = torch.cuda.memory_allocated() / 1e9
        snapshot["reserved_gb"]  = torch.cuda.memory_reserved() / 1e9
        snapshot["total_gb"]     = torch.cuda.get_device_properties(0).total_memory / 1e9
        snapshot["free_gb"]      = snapshot["total_gb"] - snapshot["reserved_gb"]
    else:
        snapshot["allocated_gb"] = 0.0
        snapshot["reserved_gb"]  = 0.0
        snapshot["total_gb"]     = 0.0
        snapshot["free_gb"]      = 0.0
    return snapshot


snapshots = []

# Snapshot: after model load (captured earlier)
snap = memory_snapshot("After 4-bit model load")
snap["allocated_gb"] = mem_with_lora  # use our earlier measurement
snapshots.append(snap)

# Initialize trainer
qlora_trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=qlora_training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

snapshots.append(memory_snapshot("After SFTTrainer init"))

print("Memory Tracking — QLoRA Pipeline")
print("=" * 60)
print(f"{'Stage':<35} {'Allocated':>10} {'Reserved':>10}")
print("-" * 60)
for s in snapshots:
    print(f"{s['label']:<35} {s['allocated_gb']:>9.2f}G {s['reserved_gb']:>9.2f}G")

if torch.cuda.is_available():
    total = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"\nTotal VRAM available: {total:.1f} GB")
    current = torch.cuda.memory_reserved() / 1e9
    print(f"Current usage       : {current:.1f} GB ({current/total*100:.1f}%)")
    print(f"Remaining free      : {total-current:.1f} GB")

print()
print("QLoRA Trainer initialized.")
print(f"Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

import time

print("Starting QLoRA training...")
print("=" * 55)

start_time = time.time()

# Run training
train_result = qlora_trainer.train()

elapsed = time.time() - start_time
mem_peak = torch.cuda.max_memory_allocated() / 1e9 if torch.cuda.is_available() else 0.0

print()
print("QLoRA Training Complete!")
print("=" * 55)
print(f"  Training time    : {elapsed/60:.1f} minutes")
print(f"  Final train loss : {train_result.training_loss:.4f}")
print(f"  Peak GPU memory  : {mem_peak:.2f} GB")
if hasattr(train_result.metrics, 'get'):
    samples_per_sec = train_result.metrics.get('train_samples_per_second', 'N/A')
    print(f"  Samples/sec      : {samples_per_sec}")

# Save the LoRA adapter (NOT the full model — just the small adapter weights)

ADAPTER_DIR = f"{OUTPUT_DIR}/lora-adapter"

print("Saving LoRA adapter...")
model.save_pretrained(ADAPTER_DIR)          # saves only adapter weights (~MB, not GB)
tokenizer.save_pretrained(ADAPTER_DIR)

print(f"\nAdapter saved to: {ADAPTER_DIR}/")

if os.path.exists(ADAPTER_DIR):
    files = sorted(os.listdir(ADAPTER_DIR))
    total_mb = 0
    for f in files:
        fp = os.path.join(ADAPTER_DIR, f)
        if os.path.isfile(fp):
            size_mb = os.path.getsize(fp) / 1e6
            total_mb += size_mb
            print(f"  {f:<50} {size_mb:>7.1f} MB")
    print(f"\n  Total adapter size: {total_mb:.1f} MB")
    print(f"  (Compare to full model: {sum(p.numel() for p in model.parameters()) * 2 / 1e9 * 1000:.0f} MB in BF16)")

print()
print("The adapter contains only the LoRA matrices (A and B).")
print("At inference time, load base model + adapter — no quality loss!")

# Inference with the QLoRA-trained adapter

from peft import PeftModel

print("Loading base model + LoRA adapter for inference...")

# Option A: Use the already-loaded model (adapter is already attached)
model.eval()

def generate_qlora(
    instruction: str,
    input_text: str = "",
    max_new_tokens: int = 200,
    temperature: float = 0.7,
) -> str:
    """Generate a response using the QLoRA fine-tuned model."""
    if input_text.strip():
        prompt = ALPACA_PROMPT_WITH_INPUT.format(
            instruction=instruction,
            input=input_text,
            output="",
        ).rstrip()
    else:
        prompt = ALPACA_PROMPT_WITHOUT_INPUT.format(
            instruction=instruction,
            output="",
        ).rstrip()

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    input_len = inputs["input_ids"].shape[1]

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=temperature > 0,
            top_p=0.9,
            repetition_penalty=1.1,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
        )

    new_tokens = outputs[0][input_len:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()


print("\n" + "=" * 60)
print("QLoRA Model Inference")
print("=" * 60)

test_cases = [
    {"instruction": "Explain the difference between supervised and unsupervised learning.", "input": ""},
    {"instruction": "Write a Python function to check if a number is prime.", "input": ""},
    {"instruction": "Summarize the key points of this text.",
     "input": "Transformers are deep learning models that use self-attention mechanisms to process sequential data. They were introduced in the paper 'Attention Is All You Need' in 2017 and have since become the dominant architecture for NLP tasks."},
]

for i, tc in enumerate(test_cases):
    print(f"\n[Test {i+1}]")
    print(f"Instruction: {tc['instruction']}")
    if tc['input']:
        print(f"Input: {tc['input'][:80]}...")
    response = generate_qlora(tc["instruction"], tc.get("input", ""))
    print(f"Response: {response}")
    print("-" * 60)

9. Merging LoRA Adapters for Deployment¶

For production deployment, you have two options:

Option A: Serve Base + Adapter (no merge)¶

Keep the 4-bit base model and load the adapter at startup
Pros: Small adapter files, easy to swap adapters
Cons: Slightly slower inference (adapter overhead), requires PEFT library

Option B: Merge Adapter into Base Model¶

Calling merge_and_unload() mathematically merges W + B·A back into the base model weights and removes the adapter overhead:

merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")

Pros: Standard model format, fastest inference, no PEFT dependency
Cons: Larger saved model, cannot swap adapters

Important note: After merging, the model is no longer quantized. The merged weights are in BF16 (the compute dtype). To re-quantize for efficient inference, run the merged model through bitsandbytes or llama.cpp.

Option C: Merge + Re-quantize with GGUF (for llama.cpp)¶

# 1. Merge adapter
python merge_adapter.py --base MODEL_ID --adapter ADAPTER_DIR --output merged/

# 2. Convert to GGUF
python llama.cpp/convert.py merged/ --outtype f16 --outfile model.gguf

# 3. Quantize GGUF
llama.cpp/quantize model.gguf model-q4_k_m.gguf Q4_K_M

# Demonstrate how to merge the LoRA adapter into the base model
# (This creates a standard model without adapter overhead)

MERGED_DIR = f"{OUTPUT_DIR}/merged-model"

print("Merging LoRA adapter into base model...")
print("Note: This upscales weights from NF4 to BF16.")
print()

# merge_and_unload() merges W + B·A and removes LoRA layers
merged_model = model.merge_and_unload()

mem_merged, _ = get_gpu_memory_gb()
print(f"GPU memory after merge: {mem_merged:.2f} GB")
print("(Larger than 4-bit model — merged model is in BF16)")
print()

# Save merged model
print(f"Saving merged model to: {MERGED_DIR}/")
merged_model.save_pretrained(MERGED_DIR, safe_serialization=True)
tokenizer.save_pretrained(MERGED_DIR)

if os.path.exists(MERGED_DIR):
    total_size = sum(
        os.path.getsize(os.path.join(MERGED_DIR, f))
        for f in os.listdir(MERGED_DIR)
        if os.path.isfile(os.path.join(MERGED_DIR, f))
    )
    print(f"Merged model size: {total_size / 1e9:.2f} GB")

print()
print("Merged model can be loaded like any standard Hugging Face model:")
print(f"""
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "{MERGED_DIR}",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
# No PEFT library needed! Standard inference.
""")

# Clean up
del merged_model
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

10. Complete QLoRA Pipeline — Reference Code¶

The following cell contains the complete, minimal QLoRA fine-tuning pipeline in one place. Use this as a starting template for your own projects.

"""
Complete QLoRA Fine-tuning Pipeline — Reference Template
=========================================================
Copy this cell as a standalone script for your own projects.

Requirements:
    pip install transformers trl peft bitsandbytes datasets accelerate
"""

# --- [0] Configuration ---
CONFIG = {
    "model_id":            "Qwen/Qwen2.5-1.5B-Instruct",
    "dataset_id":          "tatsu-lab/alpaca",
    "output_dir":          "./qlora-output",

    # LoRA
    "lora_r":              16,
    "lora_alpha":          32,
    "lora_dropout":        0.05,
    "use_dora":            False,      # set True for better quality
    "use_rslora":          False,      # set True for r >= 32

    # Quantization
    "quant_type":          "nf4",      # "nf4" or "fp4"
    "double_quant":        True,

    # Training
    "max_seq_length":      1024,
    "batch_size":          2,
    "grad_accum":          8,
    "learning_rate":       2e-4,
    "num_epochs":          1,
    "warmup_ratio":        0.03,
    "optimizer":           "paged_adamw_8bit",
}

print("Complete QLoRA Pipeline — Configuration")
print("=" * 55)
for k, v in CONFIG.items():
    print(f"  {k:<22}: {v}")

print()
print("""
Pipeline steps (reference — already executed above):
─────────────────────────────────────────────────────

1. Configure BitsAndBytesConfig
   bnb_config = BitsAndBytesConfig(
       load_in_4bit=True,
       bnb_4bit_quant_type=CONFIG["quant_type"],
       bnb_4bit_compute_dtype=torch.bfloat16,
       bnb_4bit_use_double_quant=CONFIG["double_quant"],
   )

2. Load tokenizer + 4-bit quantized model
   tokenizer = AutoTokenizer.from_pretrained(CONFIG["model_id"])
   model = AutoModelForCausalLM.from_pretrained(
       CONFIG["model_id"], quantization_config=bnb_config, device_map="auto"
   )

3. Prepare for k-bit training
   model = prepare_model_for_kbit_training(model)

4. Apply LoRA adapters
   lora_cfg = LoraConfig(r=CONFIG["lora_r"], ...)
   model = get_peft_model(model, lora_cfg)

5. Load and format dataset
   dataset = load_dataset(CONFIG["dataset_id"])
   dataset = dataset.map(format_alpaca)

6. Configure SFTTrainer and train
   trainer = SFTTrainer(model, tokenizer, args, train_dataset, eval_dataset)
   trainer.train()

7. Save adapter
   model.save_pretrained("./adapter")

8. (Optional) Merge + save full model
   merged = model.merge_and_unload()
   merged.save_pretrained("./merged")
""")

Summary — Key Takeaways¶

QLoRA Architecture Recap¶

QLoRA = NF4 quantized base (frozen) + LoRA adapters (trainable in BF16)

Memory savings vs full SFT:
  7B model: ~86 GB → ~6 GB   (14× reduction)
  70B model: ~860 GB → ~40 GB (21× reduction)

When to Use Each PEFT Method¶

Scenario	Method	Why
VRAM > 2× model BF16 size	LoRA	Faster, slightly better quality
VRAM < 2× model BF16 size	QLoRA	Only option for large models
Need best quality at same params	DoRA	+2-4% benchmark improvement
Using high ranks (r≥32)	RSLoRA	Stable gradients, better convergence
Best quality, enough VRAM	DoRA+RSLoRA	Combines both improvements
Tiny GPU, minimal params	IA3 / Prompt tuning	<0.1% parameters

QLoRA Best Practices Checklist¶

[x] Use NF4 quantization (not FP4) for transformer models
[x] Enable double quantization (saves ~3GB extra on 70B models)
[x] Set compute_dtype to bfloat16 (Ampere+) or float16 (older)
[x] Use prepare_model_for_kbit_training() before applying LoRA
[x] Use paged_adamw_8bit optimizer for minimal memory spikes
[x] Set LR = 2e-4 to 3e-4 (higher than full SFT)
[x] Set max_grad_norm = 0.3 (conservative for 4-bit)
[x] Enable gradient checkpointing (reduces activation memory)
[x] Enable sequence packing (2-3x throughput improvement)
[x] Target both attention AND MLP layers for best quality
[x] Consider DoRA=True for +2-4% quality with no extra memory
[x] Save only the adapter (~MB), not the full model (~GB)

Memory Quick Reference¶

GPU VRAM    → Maximum model size with QLoRA
GB        → 1.5B–3B parameters
GB        → 3B–7B parameters
GB        → 7B parameters comfortably
GB       → 7B–13B parameters
GB       → 13B parameters
GB       → 13B–30B parameters
GB       → 30B–70B parameters
GB       → 70B parameters (A100/H100 class)

Next Steps¶

Notebook 05 — Instruction Tuning: curate better training datasets
Notebook 06 — Evaluation: measure model quality rigorously
Notebook 07 — RLHF/DPO: align fine-tuned model behavior
Notebook 08 — Deployment: serve your fine-tuned model in production

References:

QLoRA paper: Efficient Finetuning of Quantized LLMs — Dettmers et al., 2023
DoRA: Weight-Decomposed Low-Rank Adaptation — Liu et al., 2024
RSLoRA: Rank Stabilization Scaling — Kalajdzievski, 2023
LoRA: Low-Rank Adaptation of LLMs — Hu et al., 2021
PEFT library documentation
BitsAndBytes documentation