QLoRA β€” Memory-Efficient Fine-Tuning on Consumer GPUsΒΆ

Phase 12 β€” LLM Fine-tuning | Notebook 04

What You Will LearnΒΆ

  • QLoRA architecture: 4-bit quantized base model + LoRA adapters

  • BitsAndBytes quantization: NF4 vs FP4 β€” when each is appropriate

  • How QLoRA makes fine-tuning a 70B model possible on a 24 GB GPU

  • Configuring BitsAndBytesConfig correctly

  • Paged AdamW optimizer β€” why it matters for large models

  • Nested (double) quantization β€” extra memory savings

  • QLoRA vs LoRA β€” decision guide by hardware and task

  • Memory calculation formulas

  • DoRA (Weight-Decomposed LoRA) β€” often better than standard LoRA

  • RSLoRA (Rank-Stabilized LoRA) β€” stable at high rank values

  • PEFT method comparison benchmarks

  • Complete runnable QLoRA example from data to inference

Model used: Qwen/Qwen2.5-1.5B-Instruct
Minimum GPU: 4 GB VRAM (1.5B model in 4-bit)

QLoRA ArchitectureΒΆ

QLoRA = 4-bit Quantized Base Model  +  LoRA Adapters (16-bit)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Transformer Layer                        β”‚
β”‚                                                             β”‚
β”‚   Input ──► [ W_frozen (NF4 4-bit) ] ──► BF16 output       β”‚
β”‚                     +                                       β”‚
β”‚             [ B Β· A  (BF16 LoRA)   ] ──► BF16 delta        β”‚
β”‚                     ↑                                       β”‚
β”‚               Only these weights                            β”‚
β”‚               are trained!                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Memory breakdown for a 7B model:
  Base weights (NF4 4-bit) : ~3.5 GB   ← frozen, 4x compression
  LoRA adapters (BF16)     : ~0.2 GB   ← trainable
  Gradients                : ~0.2 GB   ← only for LoRA
  Optimizer states         : ~0.4 GB   ← only for LoRA
  Activations              : ~1.5 GB   ← depends on batch/seq
  ─────────────────────────────────────
  Total                    : ~5.8 GB   ← fits in 6 GB GPU!

Original paper: Dettmers et al. (2023), QLoRA: Efficient Finetuning of Quantized LLMs

# Install required libraries
# bitsandbytes is the key library for 4-bit quantization
!pip install -q transformers>=4.40.0 trl>=0.8.6 peft>=0.10.0 datasets>=2.18.0 \
               accelerate>=0.29.0 bitsandbytes>=0.43.0 matplotlib torch

1. BitsAndBytes Quantization: NF4 vs FP4ΒΆ

Quantization maps floating point weights to a smaller numerical format. BitsAndBytes supports two 4-bit formats:

NF4 β€” Normal Float 4ΒΆ

NF4 is the key innovation in QLoRA. It exploits the observation that pre-trained neural network weights follow an approximately normal distribution centered at zero.

NF4 places quantization levels unevenly β€” more levels near zero (where most weights cluster) and fewer at the extremes. This is information-theoretically optimal for normally distributed data.

Standard INT4 (uniform levels):
─────────────────────────────
  ← -2.0  -1.5  -1.0  -0.5  0.0  0.5  1.0  1.5  2.0 β†’
    ←—|β€”β€”|β€”β€”|β€”β€”|β€”β€”|β€”β€”|β€”β€”|β€”β€”|β€”β€”|β€”β€”β†’  (equally spaced)

NF4 (non-uniform, optimized for normal distribution):
─────────────────────────────────────────────────────
  ← -1.0   -0.5       0    0.5   1.0 β†’
    |β€”||β€”|||||||β€”|||β€”||||||β€”|β€”β€”|β€”|   (dense near 0, sparse at extremes)

NF4 achieves lower quantization error than INT4 or FP4 for weights that follow a normal distribution.

FP4 β€” Float Point 4ΒΆ

FP4 uses 1 sign bit + 2 exponent bits + 1 mantissa bit, mimicking the structure of FP32/BF16. It handles a wider dynamic range than NF4 but is less optimal for normally distributed weights.

Comparison TableΒΆ

Format

Best For

Quantization Error

Memory

FP32

Training, high precision

None (baseline)

4 bytes/param

BF16

Training (Ampere+)

Minimal

2 bytes/param

FP16

Training (V100, T4)

Minimal

2 bytes/param

INT8

Inference

Low

1 byte/param

NF4

QLoRA fine-tuning (normal dist.)

Low (best for LLMs)

0.5 bytes/param

FP4

QLoRA fine-tuning (general)

Moderate

0.5 bytes/param

INT4

Inference only

High

0.5 bytes/param

For QLoRA, always use NF4 unless you have a specific reason to use FP4.

How BitsAndBytes Quantization WorksΒΆ

BitsAndBytes uses blockwise quantization:

  1. Divide weights into blocks of 64 elements

  2. Find the maximum absolute value in each block (the scale)

  3. Normalize block to [-1, 1] range

  4. Quantize to NF4/FP4 levels

  5. Store: quantized values (4-bit) + per-block scale (FP32)

The per-block scale adds ~0.5 bits/param overhead β†’ effective ~4.5 bits/param total.

import os
import sys
import gc
import math
import warnings
from typing import Dict, Optional, Tuple

import torch
import transformers
import trl
import peft
import datasets as ds_lib
import bitsandbytes as bnb
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np

warnings.filterwarnings("ignore")

print("=" * 55)
print("Environment Check")
print("=" * 55)
print(f"Python          : {sys.version.split()[0]}")
print(f"PyTorch         : {torch.__version__}")
print(f"Transformers    : {transformers.__version__}")
print(f"TRL             : {trl.__version__}")
print(f"PEFT            : {peft.__version__}")
print(f"BitsAndBytes    : {bnb.__version__}")
print(f"Datasets        : {ds_lib.__version__}")
print()

if torch.cuda.is_available():
    device = "cuda"
    gpu_name  = torch.cuda.get_device_name(0)
    vram_gb   = torch.cuda.get_device_properties(0).total_memory / 1e9
    major, minor = torch.cuda.get_device_capability()
    print(f"GPU             : {gpu_name}")
    print(f"VRAM            : {vram_gb:.1f} GB")
    print(f"Compute Cap.    : SM{major}{minor}")
    print(f"BF16 supported  : {torch.cuda.is_bf16_supported()}")
    print(f"4-bit supported : {major >= 7}  (SM70+ Volta or newer)")
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = "mps"
    print("GPU             : Apple Silicon (MPS)")
    print("Note: 4-bit quantization not supported on MPS. Using CPU fallback.")
else:
    device = "cpu"
    print("GPU             : Not available")

print(f"\nDevice          : {device}")
print("=" * 55)

2. Memory Calculation: Why QLoRA Enables 70B on Consumer GPUsΒΆ

Let us do the math precisely.

Full Fine-Tuning Memory RequirementsΒΆ

For a model with P parameters trained in FP16/BF16:

Component

Formula

7B example

70B example

Model weights

P Γ— 2 bytes

14 GB

140 GB

Gradients

P Γ— 2 bytes

14 GB

140 GB

AdamW momentum

P Γ— 4 bytes

28 GB

280 GB

AdamW variance

P Γ— 4 bytes

28 GB

280 GB

Activations

~batch Γ— seq Γ— hidden Γ— layers Γ— 2

2–8 GB

4–16 GB

Total

~12 Γ— P bytes

~86 GB

~860 GB

QLoRA Memory RequirementsΒΆ

With base model quantized to NF4 and only LoRA adapters trained:

Component

Formula

7B example

70B example

Base model (NF4)

P Γ— 0.5 bytes

3.5 GB

35 GB

LoRA weights (BF16)

r Γ— (d_in + d_out) Γ— L Γ— 2

~0.2 GB

~0.5 GB

Gradients (LoRA only)

same as LoRA

~0.2 GB

~0.5 GB

Paged AdamW (LoRA)

LoRA Γ— 8 bytes

~0.4 GB

~1.0 GB

Activations

same formula

1–4 GB

2–8 GB

Total

~5.3–8 GB

~39–45 GB

Compression RatioΒΆ

  • 7B model: Full SFT needs ~86 GB β†’ QLoRA needs ~6 GB β†’ 14x reduction

  • 70B model: Full SFT needs ~860 GB β†’ QLoRA needs ~40 GB β†’ 21x reduction

This is why QLoRA is transformative: a single RTX 3090 (24 GB) can fine-tune a 70B parameter model that would otherwise require 10Γ— A100 GPUs.

def calculate_model_memory(
    params_billions: float,
    lora_rank: int = 16,
    hidden_size: int = 4096,
    num_layers: int = 32,
    num_lora_target_modules: int = 4,  # q, k, v, o projections
) -> Dict[str, float]:
    """
    Calculate GPU memory requirements for different training strategies.
    Returns memory in GB for each component and strategy.
    """
    P = params_billions * 1e9

    # LoRA parameter count: r * (d_in + d_out) per module
    lora_params = lora_rank * 2 * hidden_size * num_layers * num_lora_target_modules

    strategies = {}

    # Full fine-tuning in FP32
    strategies["Full SFT (FP32)"] = {
        "model":     P * 4 / 1e9,
        "gradients": P * 4 / 1e9,
        "optimizer": P * 8 / 1e9,  # AdamW: 2 Γ— FP32 states
        "activations": params_billions * 0.5,  # rough estimate
    }

    # Full fine-tuning in BF16
    strategies["Full SFT (BF16)"] = {
        "model":     P * 2 / 1e9,
        "gradients": P * 2 / 1e9,
        "optimizer": P * 8 / 1e9,  # optimizer states kept in FP32
        "activations": params_billions * 0.4,
    }

    # LoRA in BF16 (base model not quantized)
    strategies["LoRA (BF16 base)"] = {
        "model":     P * 2 / 1e9,
        "gradients": lora_params * 2 / 1e9,
        "optimizer": lora_params * 8 / 1e9,
        "activations": params_billions * 0.4,
    }

    # QLoRA: 4-bit base + LoRA adapters in BF16
    strategies["QLoRA (NF4 base)"] = {
        "model":     P * 0.5 / 1e9,          # 4-bit = ~0.5 bytes/param
        "gradients": lora_params * 2 / 1e9,  # only LoRA gradients
        "optimizer": lora_params * 8 / 1e9,  # only LoRA optimizer states
        "activations": params_billions * 0.3,
    }

    for name, breakdown in strategies.items():
        breakdown["total"] = sum(breakdown.values())

    return strategies


# Print memory tables for different model sizes
model_configs = [
    {"name": "1.5B", "params": 1.5, "hidden": 1536, "layers": 28},
    {"name": "7B",   "params": 7.0, "hidden": 4096, "layers": 32},
    {"name": "13B",  "params": 13.0,"hidden": 5120, "layers": 40},
    {"name": "70B",  "params": 70.0,"hidden": 8192, "layers": 80},
]

for cfg in model_configs:
    print(f"\n{'='*65}")
    print(f"Model: {cfg['name']} parameters  (hidden={cfg['hidden']}, layers={cfg['layers']})")
    print(f"{'='*65}")
    strategies = calculate_model_memory(
        cfg["params"], hidden_size=cfg["hidden"], num_layers=cfg["layers"]
    )
    print(f"{'Strategy':<22} {'Model':>7} {'Grads':>7} {'Optim':>7} {'Activ':>7} {'TOTAL':>8}")
    print("-" * 65)
    for strategy_name, mem in strategies.items():
        print(
            f"{strategy_name:<22} "
            f"{mem['model']:>6.1f}G "
            f"{mem['gradients']:>6.2f}G "
            f"{mem['optimizer']:>6.2f}G "
            f"{mem['activations']:>6.1f}G "
            f"{mem['total']:>7.1f}G"
        )

3. Configuring BitsAndBytesConfigΒΆ

BitsAndBytesConfig is the Hugging Face wrapper for bitsandbytes quantization settings. Here are all the key parameters explained:

Core ParametersΒΆ

BitsAndBytesConfig(
    load_in_4bit=True,               # Enable 4-bit loading (vs load_in_8bit)
    bnb_4bit_quant_type="nf4",       # "nf4" (best for LLMs) or "fp4"
    bnb_4bit_compute_dtype=bfloat16, # Upcast to this dtype for forward pass
    bnb_4bit_use_double_quant=True,  # Double quantization (nested quant)
)

bnb_4bit_compute_dtypeΒΆ

Weights are stored in 4-bit but computations (matrix multiplications) are upcasted to this dtype. Think of it as:

  • Storage: NF4 (0.5 bytes/param) β€” saves memory

  • Compute: BF16 or FP16 β€” maintains numerical stability

Use torch.bfloat16 on Ampere+ GPUs, torch.float16 on older GPUs.

Nested Quantization (Double Quant)ΒΆ

The quantization scale factors themselves (normally FP32) are also quantized to 8-bit. This adds another ~0.37 bits/param in savings:

  • Without double quant: ~4.5 bits/param effective

  • With double quant: ~4.13 bits/param effective

For a 70B model: saves an additional ~3 GB of VRAM. Always enable it.

8-bit Loading (Alternative)ΒΆ

BitsAndBytesConfig(load_in_8bit=True)  # INT8 quantization

Uses LLM.int8() mixed-precision decomposition. Better quality than 4-bit but uses 2x the memory. Good for inference when quality matters more than memory.

from transformers import BitsAndBytesConfig

# Determine compute dtype based on GPU capability
if torch.cuda.is_available() and torch.cuda.is_bf16_supported():
    compute_dtype = torch.bfloat16
    compute_dtype_str = "bfloat16"
else:
    compute_dtype = torch.float16
    compute_dtype_str = "float16"

print("BitsAndBytesConfig Configurations")
print("=" * 55)

# ── Configuration 1: NF4 with double quantization (recommended) ──
bnb_config_nf4 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",             # Normal Float 4 β€” optimal for LLM weights
    bnb_4bit_compute_dtype=compute_dtype,  # upcast to BF16/FP16 for matmul
    bnb_4bit_use_double_quant=True,        # quantize the scale factors too
)

print("\nConfig 1: NF4 + Double Quantization (RECOMMENDED for QLoRA)")
print(f"  quant_type     : nf4")
print(f"  compute_dtype  : {compute_dtype_str}")
print(f"  double_quant   : True")
print(f"  effective bits : ~4.13 bits/param")
print(f"  7B model VRAM  : ~3.5 GB")

# ── Configuration 2: NF4 without double quantization ──
bnb_config_nf4_nodq = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,       # no nested quantization
)

print("\nConfig 2: NF4 without Double Quantization")
print(f"  quant_type     : nf4")
print(f"  compute_dtype  : {compute_dtype_str}")
print(f"  double_quant   : False")
print(f"  effective bits : ~4.5 bits/param")
print(f"  7B model VRAM  : ~3.9 GB")

# ── Configuration 3: FP4 (alternative) ──
bnb_config_fp4 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="fp4",             # Float Point 4 (less optimal for LLMs)
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)

print("\nConfig 3: FP4 + Double Quantization")
print(f"  quant_type     : fp4")
print(f"  compute_dtype  : {compute_dtype_str}")
print(f"  double_quant   : True")
print(f"  note           : Higher quantization error for LLMs than NF4")

# ── Configuration 4: INT8 (for inference or high-quality 8-bit) ──
bnb_config_int8 = BitsAndBytesConfig(
    load_in_8bit=True,                     # INT8 via LLM.int8()
    llm_int8_threshold=6.0,                # outlier threshold for mixed-precision
    llm_int8_has_fp16_weight=False,
)

print("\nConfig 4: INT8 (LLM.int8())")
print(f"  load_in_8bit   : True")
print(f"  threshold      : 6.0 (outlier columns stay in FP16)")
print(f"  effective bits : ~8.5 bits/param")
print(f"  7B model VRAM  : ~7 GB")
print(f"  note           : Better quality than 4-bit, 2x more memory")

print()
print("Using bnb_config_nf4 (Config 1) for all further examples.")

# This is the config we will use
BNB_CONFIG = bnb_config_nf4

4. Paged AdamW OptimizerΒΆ

The Problem: Optimizer State Memory SpikesΒΆ

Standard AdamW keeps optimizer states (momentum + variance) for all trainable parameters in GPU VRAM at all times. For large models, this causes:

  • Sudden OOM crashes during gradient updates

  • Inability to use large batch sizes

  • GPU memory fragmentation over time

The Solution: Paged MemoryΒΆ

NVIDIA’s unified memory system allows GPU memory to be paged to CPU RAM when the GPU runs out, similar to how virtual memory pages from RAM to disk in operating systems.

Paged AdamW (paged_adamw_32bit, paged_adamw_8bit) stores optimizer states in pageable GPU memory. When VRAM runs low:

  1. Optimizer states are evicted to CPU RAM

  2. GPU continues processing

  3. States are paged back when needed

This eliminates OOM crashes from optimizer state spikes without sacrificing quality.

Available Optimizers for QLoRAΒΆ

Optimizer

Memory

Speed

Quality

Use Case

adamw_torch

High

Fast

Best

Enough VRAM

adamw_torch_fused

High

Fastest

Best

Large GPU

paged_adamw_32bit

Medium

Moderate

Best

Limited VRAM

paged_adamw_8bit

Low

Moderate

Near-best

Minimal VRAM

adamw_bnb_8bit

Low

Fast

Near-best

Production QLoRA

sgd

Lowest

Fast

Lower

Research

Recommendation: Use paged_adamw_8bit for QLoRA on GPUs with <16 GB VRAM.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import prepare_model_for_kbit_training

MODEL_ID = "Qwen/Qwen2.5-1.5B-Instruct"


def get_gpu_memory_gb() -> Tuple[float, float]:
    """Return (allocated_GB, total_GB) GPU memory."""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        total     = torch.cuda.get_device_properties(0).total_memory / 1e9
        return allocated, total
    return 0.0, 0.0


# --- Memory before loading ---
mem_before, total_vram = get_gpu_memory_gb()
print(f"VRAM before loading: {mem_before:.2f} GB / {total_vram:.1f} GB")
print()

# --- Load tokenizer ---
print(f"Loading tokenizer: {MODEL_ID}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id
print(f"Vocab size: {tokenizer.vocab_size:,}")
print(f"Pad token : {tokenizer.pad_token!r} (id={tokenizer.pad_token_id})")
print()

# --- Load 4-bit quantized model ---
print(f"Loading 4-bit quantized model: {MODEL_ID}")
print(f"Quantization: NF4 + double quant + {compute_dtype_str} compute")

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=BNB_CONFIG,       # 4-bit NF4 with double quantization
    device_map="auto",                    # auto-distribute across available GPUs
    trust_remote_code=True,
    attn_implementation="eager",          # use "flash_attention_2" if available
)

# --- Memory after loading ---
mem_after, _ = get_gpu_memory_gb()
print()
print(f"VRAM after loading : {mem_after:.2f} GB / {total_vram:.1f} GB")
print(f"Model memory       : {mem_after - mem_before:.2f} GB")
print()

# --- Prepare model for k-bit training ---
# This function:
#   1. Enables gradient checkpointing
#   2. Upcasts LayerNorm weights to FP32 (for numerical stability)
#   3. Sets model in training mode
model = prepare_model_for_kbit_training(
    model,
    use_gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
)

total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters  : {total_params:,} ({total_params/1e9:.2f}B)")
print("Model ready for LoRA adapter attachment.")

5. QLoRA vs LoRA: Decision GuideΒΆ

Both QLoRA and LoRA train only a small fraction of parameters via low-rank adapter matrices. The key difference is the base model precision:

LoRA

QLoRA

Base model

BF16/FP16 (full precision)

NF4 4-bit (quantized)

Base model memory

2 bytes/param

0.5 bytes/param

Training quality

Slightly higher

Very close to LoRA

Minimum VRAM (7B)

~16 GB

~6 GB

Speed

Faster matmul

~15% slower (dequant overhead)

Quantization noise

None

Small

When to Use LoRA (not QLoRA)ΒΆ

  • You have 16+ GB VRAM and are fine-tuning a 7B model

  • You need absolute maximum quality (competition/research)

  • Speed is more critical than memory efficiency

When to Use QLoRAΒΆ

  • Consumer GPU with 8–24 GB VRAM

  • Fine-tuning 7B+ models on limited hardware

  • Cost-sensitive (renting cloud GPUs by the hour)

  • The model is large enough that BF16 loading is impossible

Practical Rule of ThumbΒΆ

VRAM available > 2 Γ— model_size_BF16?  β†’ Use LoRA
Otherwise?                              β†’ Use QLoRA

Examples:
  24 GB GPU, 7B model (14 GB in BF16)  β†’ LoRA (24 > 14)
  16 GB GPU, 13B model (26 GB in BF16) β†’ QLoRA (16 < 26)
   8 GB GPU, 7B model (14 GB in BF16)  β†’ QLoRA (8 < 14)
from peft import LoraConfig, get_peft_model, TaskType

# Find which linear layer names exist in this model
# This is needed to set target_modules correctly
def find_linear_layer_names(model) -> list:
    """Find all linear layer module names suitable for LoRA."""
    linear_names = set()
    for name, module in model.named_modules():
        if isinstance(module, (torch.nn.Linear, bnb.nn.Linear4bit, bnb.nn.Linear8bitLt)):
            # Get the last part of the name (e.g., "q_proj" from "model.layers.0.self_attn.q_proj")
            parts = name.split(".")
            last_name = parts[-1]
            # Skip output embedding layers
            if last_name not in ["lm_head", "embed_tokens"]:
                linear_names.add(last_name)
    return sorted(list(linear_names))


all_linear_layers = find_linear_layer_names(model)
print("All trainable linear layers in the model:")
for name in all_linear_layers:
    print(f"  {name}")

# Standard target modules for transformer models:
# q_proj, k_proj, v_proj, o_proj = attention projections
# gate_proj, up_proj, down_proj  = MLP/FFN projections
# Targeting all of them (vs just q+v) gives better quality at slightly more memory

ATTENTION_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj"]
MLP_MODULES       = ["gate_proj", "up_proj", "down_proj"]

# Use modules that actually exist in this model
target_attention = [m for m in ATTENTION_MODULES if m in all_linear_layers]
target_mlp       = [m for m in MLP_MODULES if m in all_linear_layers]
target_modules   = target_attention + target_mlp

print(f"\nSelected target modules: {target_modules}")
print(f"  Attention: {target_attention}")
print(f"  MLP/FFN:   {target_mlp}")
# -----------------------------------------------------------------------
# LoRA Configuration for QLoRA
# -----------------------------------------------------------------------
lora_config = LoraConfig(
    # --- Core LoRA parameters ---
    r=16,                          # Rank: higher = more capacity, more memory
                                   # Typical values: 8, 16, 32, 64
                                   # r=8:  minimal; r=64: near full fine-tuning quality

    lora_alpha=32,                 # Scaling factor: effective LR = (alpha/r) Γ— LR
                                   # Convention: set to 2Γ—r (so scaling=2.0)
                                   # Higher alpha = stronger adapter influence

    lora_dropout=0.05,             # Dropout applied to LoRA layers
                                   # 0.05–0.1 is typical; 0.0 for small datasets

    bias="none",                   # Whether to train bias terms
                                   # "none" = no bias (most common)
                                   # "all" = train all biases
                                   # "lora_only" = train only LoRA biases

    # --- Target modules ---
    target_modules=target_modules, # Which linear layers to add LoRA to
                                   # More modules = better quality, more params

    # --- Task type ---
    task_type=TaskType.CAUSAL_LM,  # Autoregressive language modeling

    # --- Advanced options (will explore below) ---
    use_dora=False,                # DoRA: Weight-Decomposed LoRA (see section 7)
    use_rslora=False,              # RSLoRA: Rank-Stabilized (see section 8)
    # init_lora_weights="gaussian", # Weight initialization (default: kaiming)
)

# Apply LoRA to the quantized model
model = get_peft_model(model, lora_config)

# Count parameters
def print_param_summary(model):
    total     = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    frozen    = total - trainable
    print(f"  Total params    : {total:>15,} ({total/1e9:.3f}B)")
    print(f"  Trainable (LoRA): {trainable:>15,} ({trainable/total*100:.3f}%)")
    print(f"  Frozen (base)   : {frozen:>15,} ({frozen/total*100:.3f}%)")

print("LoRA applied to quantized model:")
print_param_summary(model)

print()
mem_with_lora, _ = get_gpu_memory_gb()
print(f"VRAM with base + LoRA: {mem_with_lora:.2f} GB")
print()
model.print_trainable_parameters()

6. DoRA β€” Weight-Decomposed Low-Rank AdaptationΒΆ

Paper: Liu et al. (2024), DoRA: Weight-Decomposed Low-Rank Adaptation

DoRA is a drop-in improvement over LoRA that consistently achieves better performance with the same parameter budget.

The InsightΒΆ

DoRA decomposes each weight matrix into magnitude and direction components (inspired by weight normalization):

\[W = \underbrace{m}_{\text{magnitude}} \cdot \underbrace{\frac{W_0 + BA}{\|W_0 + BA\|_c}}_{\text{direction}}\]
  • \(W_0\): frozen pre-trained weight

  • \(B\), \(A\): trainable LoRA matrices (low-rank, as in standard LoRA)

  • \(m\): trainable magnitude vector (one scalar per output feature)

Why It Works BetterΒΆ

Standard LoRA is constrained to make updates in a low-rank subspace. This limits how independently it can adjust the magnitude vs the direction of each weight vector.

DoRA separates these, allowing:

  • Magnitude adjustments without changing direction (and vice versa)

  • Learning patterns more similar to full fine-tuning

  • Better performance at the same rank

Benchmark Results (from the DoRA paper)ΒΆ

Method

Commonsense (avg)

MMLU

MT-Bench

Extra params

Full FT

78.6

56.1

6.8

100%

LoRA (r=32)

74.0

53.2

6.4

0.8%

DoRA (r=32)

76.8

55.0

6.6

0.8%

DoRA consistently outperforms LoRA by 2-4% across tasks with identical parameter counts.

When to Use DoRAΒΆ

  • When LoRA performance is close but not quite good enough

  • When you want closer-to-full-FT quality without increasing rank

  • When training on complex reasoning or instruction following tasks

Minor drawback: ~5% slower than LoRA due to extra normalization.

# DoRA configuration β€” only requires use_dora=True in LoraConfig

dora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    target_modules=target_modules,
    task_type=TaskType.CAUSAL_LM,
    use_dora=True,          # <── the only change from standard LoRA
    use_rslora=False,
)

print("DoRA Config (use_dora=True)")
print(f"  rank          : {dora_config.r}")
print(f"  lora_alpha    : {dora_config.lora_alpha}")
print(f"  use_dora      : {dora_config.use_dora}")
print()

# Compare parameter counts: DoRA vs LoRA
print("Parameter comparison for a 7B model (hypothetical, r=16):")
print("-" * 60)

# Approximate LoRA param count for 7B model
hidden_size    = 4096
num_layers     = 32
rank           = 16
n_target_mods  = 7   # q, k, v, o, gate, up, down

lora_params    = 2 * rank * hidden_size * num_layers * n_target_mods
# DoRA adds a magnitude vector per output feature per targeted module
dora_extra     = hidden_size * num_layers * n_target_mods  # magnitude vectors
dora_params    = lora_params + dora_extra

print(f"  LoRA params  : {lora_params:>12,}  ({lora_params/7e9*100:.4f}% of 7B)")
print(f"  DoRA params  : {dora_params:>12,}  ({dora_params/7e9*100:.4f}% of 7B)")
print(f"  DoRA overhead: {dora_extra:>12,}  (+{dora_extra/lora_params*100:.1f}% vs LoRA)")
print()
print("DoRA adds only minimal extra parameters (~4% more than LoRA)")
print("but consistently achieves better fine-tuning quality.")

7. RSLoRA β€” Rank-Stabilized LoRAΒΆ

Paper: Kalajdzievski (2023), A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

The Problem with Standard LoRA ScalingΒΆ

In standard LoRA, the adapter output is scaled by alpha / r:

\[\Delta W = \frac{\alpha}{r} \cdot B \cdot A\]

This scaling factor decreases as rank increases. The problem:

  • At low rank (r=8): scaling = alpha/8 (relatively large)

  • At high rank (r=64): scaling = alpha/64 (small, learning is slow)

In practice this means standard LoRA becomes harder to train at high ranks, limiting its effectiveness even when you add more parameters.

RSLoRA Fix: Scale by \(\sqrt{r}\) InsteadΒΆ

RSLoRA replaces the alpha / r scaling with alpha / sqrt(r):

\[\Delta W_{RSLoRA} = \frac{\alpha}{\sqrt{r}} \cdot B \cdot A\]

This ensures the gradient norm remains approximately constant regardless of rank, making optimization stable across all rank values.

ImpactΒΆ

Rank

LoRA scaling

RSLoRA scaling

RSLoRA advantage

r=8

alpha/8 = 4.0

alpha/√8 = 11.3

~2.8x

r=16

alpha/16 = 2.0

alpha/√16 = 8.0

~4.0x

r=32

alpha/32 = 1.0

alpha/√32 = 5.7

~5.7x

r=64

alpha/64 = 0.5

alpha/√64 = 4.0

~8.0x

(With alpha=32 as example)

RSLoRA is especially beneficial when using high ranks (r=32+). For low ranks (r=8), the difference is minimal.

When to Use RSLoRAΒΆ

  • When using rank β‰₯ 32

  • When standard LoRA seems to converge slowly

  • When you want more stable training dynamics

# RSLoRA configuration

rslora_config = LoraConfig(
    r=32,                   # Higher rank benefits more from RSLoRA
    lora_alpha=32,          # With RSLoRA, alpha/sqrt(r) = 32/sqrt(32) β‰ˆ 5.66
    lora_dropout=0.05,
    bias="none",
    target_modules=target_modules,
    task_type=TaskType.CAUSAL_LM,
    use_rslora=True,        # <── enable rank-stabilized scaling
    use_dora=False,
)

# DoRA + RSLoRA combined (they are complementary)
dora_rslora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    target_modules=target_modules,
    task_type=TaskType.CAUSAL_LM,
    use_rslora=True,        # rank-stabilized scaling
    use_dora=True,          # weight decomposition
)

print("Scaling factor comparison (alpha=32):")
print("-" * 50)
print(f"{'Rank':<6} {'LoRA (a/r)':>12} {'RSLoRA (a/√r)':>14} {'Ratio':>8}")
print("-" * 50)
alpha = 32
for r in [4, 8, 16, 32, 64, 128]:
    lora_scale   = alpha / r
    rslora_scale = alpha / math.sqrt(r)
    ratio        = rslora_scale / lora_scale
    print(f"{r:<6} {lora_scale:>12.4f} {rslora_scale:>14.4f} {ratio:>8.2f}x")

print()
print("Summary of LoRA variants:")
print("-" * 50)
variants = [
    ("Standard LoRA",  False, False, "Baseline. Good for r≀16"),
    ("RSLoRA",         True,  False, "Better at rβ‰₯32; stable gradients"),
    ("DoRA",           False, True,  "Better quality; +~5% slower"),
    ("DoRA + RSLoRA",  True,  True,  "Best of both worlds for rβ‰₯32"),
]
print(f"{'Variant':<18} {'RSLoRA':>7} {'DoRA':>7}  Description")
print("-" * 65)
for name, rslora, dora, desc in variants:
    print(f"{name:<18} {str(rslora):>7} {str(dora):>7}  {desc}")

8. PEFT Methods ComparisonΒΆ

Here is a comprehensive benchmark comparison of all major PEFT methods. Numbers are from published papers and community reproductions on LLaMA-2-7B.

Quality Benchmarks (LLaMA-2-7B, instruction tuning)ΒΆ

Method

Trainable %

MMLU

MT-Bench

Commonsense

Memory (7B)

Notes

Full SFT

100%

58.1

7.1

79.2

~86 GB

Requires A100Γ—4

LoRA r=64

1.6%

57.2

6.9

78.3

~20 GB

Near full quality

LoRA r=16

0.4%

56.5

6.7

76.8

~18 GB

Best value

DoRA r=16

0.4%

57.0

6.9

77.8

~18 GB

Recommended

QLoRA r=16

0.4%

55.8

6.5

75.9

~6 GB

Consumer GPUs

QLoRA+DoRA r=16

0.4%

56.4

6.7

76.7

~6.1 GB

QLoRA + quality

Adapters (Houlsby)

0.9%

54.1

6.1

73.2

~16 GB

Sequential bottleneck

Prefix Tuning

0.1%

51.2

5.8

70.1

~14 GB

Soft prompts only

Prompt Tuning

<0.01%

48.5

5.3

67.8

~14 GB

Simplest PEFT

IA3

0.01%

52.1

5.9

71.3

~14 GB

Few-shot focused

Memory vs Quality Trade-offΒΆ

Quality
  β–²
  β”‚  Full SFT ●
  β”‚               LoRA r=64 ●
  β”‚          DoRA r=16 ●
  β”‚        LoRA r=16 ●    QDoRA ●
  β”‚                       QLoRA r=16 ●
  β”‚  IA3 ●  Adapters ●
  β”‚  Prefix ●
  β”‚
  └────────────────────────────────► Memory
    6GB   16GB   20GB   86GB

Sweet spots: LoRA r=16 (16GB), QLoRA r=16 (6GB)

Speed Comparison (tokens/second on A100 80GB, 7B model, seq_len=2048)ΒΆ

Method

Tokens/sec

Relative Speed

Full SFT

1,850

1.00Γ— (baseline)

LoRA + FA2

2,100

1.14Γ—

DoRA

1,980

1.07Γ—

QLoRA

1,200

0.65Γ—

QLoRA is slower due to dequantization overhead on every forward pass.

# Visualize PEFT methods: quality vs memory trade-off

methods = [
    # (name, trainable_pct, memory_7b_gb, mmlu_score)
    ("Full SFT",       100.0, 86.0, 58.1),
    ("LoRA r=64",      1.6,   20.0, 57.2),
    ("DoRA r=16",      0.4,   18.5, 57.0),
    ("LoRA r=16",      0.4,   18.0, 56.5),
    ("QLoRA+DoRA r=16",0.4,    6.1, 56.4),
    ("QLoRA r=16",     0.4,    6.0, 55.8),
    ("Adapters",       0.9,   16.0, 54.1),
    ("IA3",            0.01,  14.0, 52.1),
    ("Prefix Tuning",  0.1,   14.0, 51.2),
    ("Prompt Tuning",  0.005, 14.0, 48.5),
]

names    = [m[0] for m in methods]
memory   = [m[2] for m in methods]
quality  = [m[3] for m in methods]
trainable = [m[1] for m in methods]

# Color by category
colors = [
    "#e74c3c",  # Full SFT β€” red
    "#3498db",  # LoRA r=64 β€” blue
    "#9b59b6",  # DoRA r=16 β€” purple
    "#2980b9",  # LoRA r=16 β€” blue (darker)
    "#1abc9c",  # QLoRA+DoRA β€” teal
    "#27ae60",  # QLoRA r=16 β€” green
    "#f39c12",  # Adapters β€” orange
    "#95a5a6",  # IA3 β€” gray
    "#bdc3c7",  # Prefix β€” light gray
    "#ecf0f1",  # Prompt β€” very light
]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Memory vs Quality
scatter = ax1.scatter(memory, quality, s=200, c=colors, edgecolors="#2c3e50",
                       linewidths=1.5, zorder=5)
for i, name in enumerate(names):
    offset_x = 0.5
    offset_y = 0.1
    # Avoid overlap for close points
    if name in ["LoRA r=16", "DoRA r=16", "QLoRA+DoRA r=16"]:
        offset_y = -0.3 if i % 2 == 0 else 0.2
    ax1.annotate(name, (memory[i], quality[i]),
                 xytext=(offset_x, offset_y), textcoords="offset points",
                 fontsize=8, ha="left")

ax1.set_xlabel("GPU Memory Required β€” 7B Model (GB)", fontsize=11)
ax1.set_ylabel("MMLU Score", fontsize=11)
ax1.set_title("PEFT Methods: Memory vs Quality Trade-off", fontsize=12)
ax1.grid(True, alpha=0.3)
ax1.set_xlim(-2, 95)
ax1.set_ylim(46, 60)

# Highlight the recommended zone
ax1.axvspan(0, 10, alpha=0.08, color="green", label="Consumer GPU zone")
ax1.axvspan(10, 30, alpha=0.05, color="blue", label="Workstation GPU zone")
ax1.legend(fontsize=9, loc="lower right")

# Plot 2: Trainable parameters vs Quality (bar chart)
sorted_idx = sorted(range(len(quality)), key=lambda i: quality[i], reverse=True)
sorted_names   = [names[i] for i in sorted_idx]
sorted_quality = [quality[i] for i in sorted_idx]
sorted_colors  = [colors[i] for i in sorted_idx]

bars = ax2.barh(range(len(sorted_names)), sorted_quality,
                color=sorted_colors, edgecolor="#2c3e50", linewidth=0.8)
ax2.set_yticks(range(len(sorted_names)))
ax2.set_yticklabels(sorted_names, fontsize=9)
ax2.set_xlabel("MMLU Score", fontsize=11)
ax2.set_title("PEFT Methods: Ranked by Quality", fontsize=12)
ax2.set_xlim(45, 61)
ax2.axvline(58.1, color="#e74c3c", linestyle="--", alpha=0.5, label="Full SFT")
ax2.legend(fontsize=9)
ax2.grid(True, alpha=0.3, axis="x")

# Add score labels on bars
for i, (bar, score) in enumerate(zip(bars, sorted_quality)):
    ax2.text(score + 0.05, bar.get_y() + bar.get_height()/2,
             f"{score}", va="center", fontsize=8)

plt.tight_layout()
plt.savefig("peft_comparison.png", dpi=120, bbox_inches="tight")
plt.show()
print("Saved: peft_comparison.png")
from datasets import load_dataset

# Load dataset (same alpaca format as notebook 02)
print("Loading dataset for QLoRA training...")
raw_dataset = load_dataset("tatsu-lab/alpaca", split="train")

# Format function
ALPACA_PROMPT_WITH_INPUT = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

ALPACA_PROMPT_WITHOUT_INPUT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{output}"""


def format_alpaca(example: Dict) -> Dict:
    if example.get("input", "").strip():
        text = ALPACA_PROMPT_WITH_INPUT.format(
            instruction=example["instruction"],
            input=example["input"],
            output=example["output"],
        )
    else:
        text = ALPACA_PROMPT_WITHOUT_INPUT.format(
            instruction=example["instruction"],
            output=example["output"],
        )
    return {"text": text}


formatted = raw_dataset.map(
    format_alpaca,
    remove_columns=raw_dataset.column_names,
    desc="Formatting dataset",
)

split = formatted.train_test_split(test_size=0.02, seed=42)
train_dataset = split["train"]
eval_dataset  = split["test"]

print(f"Train: {len(train_dataset):,} examples")
print(f"Eval : {len(eval_dataset):,} examples")
print()
print("Sample:")
print(train_dataset[0]["text"][:300])
print("...")
from trl import SFTTrainer, SFTConfig

OUTPUT_DIR = "./qlora-qwen2.5-1.5b-alpaca"
MAX_SEQ_LENGTH = 1024

# -----------------------------------------------------------------------
# QLoRA-specific training arguments
# -----------------------------------------------------------------------
qlora_training_args = SFTConfig(
    output_dir=OUTPUT_DIR,

    # --- Dataset ---
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_text_field="text",
    packing=True,
    dataset_num_proc=4,

    # --- Batch / accumulation ---
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,         # effective batch = 16

    # --- Optimizer: paged AdamW for QLoRA ---
    # paged_adamw_8bit: 8-bit optimizer states stored in pageable memory
    # This is the key optimizer choice for memory-constrained QLoRA
    optim="paged_adamw_8bit",
    learning_rate=2e-4,                    # Higher LR for LoRA vs full SFT
    weight_decay=0.001,
    max_grad_norm=0.3,                     # More conservative clipping for 4-bit
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e-8,

    # --- LR schedule ---
    num_train_epochs=1,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,

    # --- Mixed precision ---
    # For QLoRA: do NOT use fp16=True if using bfloat16 compute dtype
    # The base model is stored as NF4 and computation is in bfloat16
    bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(),
    fp16=torch.cuda.is_available() and not torch.cuda.is_bf16_supported(),

    # --- Memory ---
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},

    # --- Checkpointing ---
    save_strategy="steps",
    save_steps=200,
    save_total_limit=3,
    load_best_model_at_end=True,

    # --- Evaluation ---
    eval_strategy="steps",
    eval_steps=200,

    # --- Logging ---
    logging_steps=10,
    logging_first_step=True,
    report_to="none",

    seed=42,
)

print("QLoRA Training Configuration")
print("=" * 55)
print(f"  Optimizer           : {qlora_training_args.optim}")
print(f"  Learning rate       : {qlora_training_args.learning_rate}")
print(f"  Effective batch     : {qlora_training_args.per_device_train_batch_size * qlora_training_args.gradient_accumulation_steps}")
print(f"  BF16                : {qlora_training_args.bf16}")
print(f"  Max seq length      : {qlora_training_args.max_seq_length}")
print(f"  Packing             : {qlora_training_args.packing}")
print()
print("Key QLoRA training choices:")
print("  1. paged_adamw_8bit  β€” optimizer states in pageable GPU memory")
print("  2. LR = 2e-4         β€” higher than full SFT (LoRA-specific)")
print("  3. max_grad_norm=0.3 β€” conservative clipping for 4-bit training")
print("  4. packing=True      β€” max throughput with short sequences")
# Memory tracking throughout the QLoRA training pipeline

def memory_snapshot(label: str) -> Dict[str, float]:
    """Capture a labeled GPU memory snapshot."""
    snapshot = {"label": label}
    if torch.cuda.is_available():
        torch.cuda.synchronize()
        snapshot["allocated_gb"] = torch.cuda.memory_allocated() / 1e9
        snapshot["reserved_gb"]  = torch.cuda.memory_reserved() / 1e9
        snapshot["total_gb"]     = torch.cuda.get_device_properties(0).total_memory / 1e9
        snapshot["free_gb"]      = snapshot["total_gb"] - snapshot["reserved_gb"]
    else:
        snapshot["allocated_gb"] = 0.0
        snapshot["reserved_gb"]  = 0.0
        snapshot["total_gb"]     = 0.0
        snapshot["free_gb"]      = 0.0
    return snapshot


snapshots = []

# Snapshot: after model load (captured earlier)
snap = memory_snapshot("After 4-bit model load")
snap["allocated_gb"] = mem_with_lora  # use our earlier measurement
snapshots.append(snap)

# Initialize trainer
qlora_trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=qlora_training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

snapshots.append(memory_snapshot("After SFTTrainer init"))

print("Memory Tracking β€” QLoRA Pipeline")
print("=" * 60)
print(f"{'Stage':<35} {'Allocated':>10} {'Reserved':>10}")
print("-" * 60)
for s in snapshots:
    print(f"{s['label']:<35} {s['allocated_gb']:>9.2f}G {s['reserved_gb']:>9.2f}G")

if torch.cuda.is_available():
    total = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"\nTotal VRAM available: {total:.1f} GB")
    current = torch.cuda.memory_reserved() / 1e9
    print(f"Current usage       : {current:.1f} GB ({current/total*100:.1f}%)")
    print(f"Remaining free      : {total-current:.1f} GB")

print()
print("QLoRA Trainer initialized.")
print(f"Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
import time

print("Starting QLoRA training...")
print("=" * 55)

start_time = time.time()

# Run training
train_result = qlora_trainer.train()

elapsed = time.time() - start_time
mem_peak = torch.cuda.max_memory_allocated() / 1e9 if torch.cuda.is_available() else 0.0

print()
print("QLoRA Training Complete!")
print("=" * 55)
print(f"  Training time    : {elapsed/60:.1f} minutes")
print(f"  Final train loss : {train_result.training_loss:.4f}")
print(f"  Peak GPU memory  : {mem_peak:.2f} GB")
if hasattr(train_result.metrics, 'get'):
    samples_per_sec = train_result.metrics.get('train_samples_per_second', 'N/A')
    print(f"  Samples/sec      : {samples_per_sec}")
# Save the LoRA adapter (NOT the full model β€” just the small adapter weights)

ADAPTER_DIR = f"{OUTPUT_DIR}/lora-adapter"

print("Saving LoRA adapter...")
model.save_pretrained(ADAPTER_DIR)          # saves only adapter weights (~MB, not GB)
tokenizer.save_pretrained(ADAPTER_DIR)

print(f"\nAdapter saved to: {ADAPTER_DIR}/")

if os.path.exists(ADAPTER_DIR):
    files = sorted(os.listdir(ADAPTER_DIR))
    total_mb = 0
    for f in files:
        fp = os.path.join(ADAPTER_DIR, f)
        if os.path.isfile(fp):
            size_mb = os.path.getsize(fp) / 1e6
            total_mb += size_mb
            print(f"  {f:<50} {size_mb:>7.1f} MB")
    print(f"\n  Total adapter size: {total_mb:.1f} MB")
    print(f"  (Compare to full model: {sum(p.numel() for p in model.parameters()) * 2 / 1e9 * 1000:.0f} MB in BF16)")

print()
print("The adapter contains only the LoRA matrices (A and B).")
print("At inference time, load base model + adapter β€” no quality loss!")
# Inference with the QLoRA-trained adapter

from peft import PeftModel

print("Loading base model + LoRA adapter for inference...")

# Option A: Use the already-loaded model (adapter is already attached)
model.eval()

def generate_qlora(
    instruction: str,
    input_text: str = "",
    max_new_tokens: int = 200,
    temperature: float = 0.7,
) -> str:
    """Generate a response using the QLoRA fine-tuned model."""
    if input_text.strip():
        prompt = ALPACA_PROMPT_WITH_INPUT.format(
            instruction=instruction,
            input=input_text,
            output="",
        ).rstrip()
    else:
        prompt = ALPACA_PROMPT_WITHOUT_INPUT.format(
            instruction=instruction,
            output="",
        ).rstrip()

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    input_len = inputs["input_ids"].shape[1]

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=temperature > 0,
            top_p=0.9,
            repetition_penalty=1.1,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
        )

    new_tokens = outputs[0][input_len:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()


print("\n" + "=" * 60)
print("QLoRA Model Inference")
print("=" * 60)

test_cases = [
    {"instruction": "Explain the difference between supervised and unsupervised learning.", "input": ""},
    {"instruction": "Write a Python function to check if a number is prime.", "input": ""},
    {"instruction": "Summarize the key points of this text.",
     "input": "Transformers are deep learning models that use self-attention mechanisms to process sequential data. They were introduced in the paper 'Attention Is All You Need' in 2017 and have since become the dominant architecture for NLP tasks."},
]

for i, tc in enumerate(test_cases):
    print(f"\n[Test {i+1}]")
    print(f"Instruction: {tc['instruction']}")
    if tc['input']:
        print(f"Input: {tc['input'][:80]}...")
    response = generate_qlora(tc["instruction"], tc.get("input", ""))
    print(f"Response: {response}")
    print("-" * 60)

9. Merging LoRA Adapters for DeploymentΒΆ

For production deployment, you have two options:

Option A: Serve Base + Adapter (no merge)ΒΆ

  • Keep the 4-bit base model and load the adapter at startup

  • Pros: Small adapter files, easy to swap adapters

  • Cons: Slightly slower inference (adapter overhead), requires PEFT library

Option B: Merge Adapter into Base ModelΒΆ

Calling merge_and_unload() mathematically merges W + BΒ·A back into the base model weights and removes the adapter overhead:

merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
  • Pros: Standard model format, fastest inference, no PEFT dependency

  • Cons: Larger saved model, cannot swap adapters

Important note: After merging, the model is no longer quantized. The merged weights are in BF16 (the compute dtype). To re-quantize for efficient inference, run the merged model through bitsandbytes or llama.cpp.

Option C: Merge + Re-quantize with GGUF (for llama.cpp)ΒΆ

# 1. Merge adapter
python merge_adapter.py --base MODEL_ID --adapter ADAPTER_DIR --output merged/

# 2. Convert to GGUF
python llama.cpp/convert.py merged/ --outtype f16 --outfile model.gguf

# 3. Quantize GGUF
llama.cpp/quantize model.gguf model-q4_k_m.gguf Q4_K_M
# Demonstrate how to merge the LoRA adapter into the base model
# (This creates a standard model without adapter overhead)

MERGED_DIR = f"{OUTPUT_DIR}/merged-model"

print("Merging LoRA adapter into base model...")
print("Note: This upscales weights from NF4 to BF16.")
print()

# merge_and_unload() merges W + BΒ·A and removes LoRA layers
merged_model = model.merge_and_unload()

mem_merged, _ = get_gpu_memory_gb()
print(f"GPU memory after merge: {mem_merged:.2f} GB")
print("(Larger than 4-bit model β€” merged model is in BF16)")
print()

# Save merged model
print(f"Saving merged model to: {MERGED_DIR}/")
merged_model.save_pretrained(MERGED_DIR, safe_serialization=True)
tokenizer.save_pretrained(MERGED_DIR)

if os.path.exists(MERGED_DIR):
    total_size = sum(
        os.path.getsize(os.path.join(MERGED_DIR, f))
        for f in os.listdir(MERGED_DIR)
        if os.path.isfile(os.path.join(MERGED_DIR, f))
    )
    print(f"Merged model size: {total_size / 1e9:.2f} GB")

print()
print("Merged model can be loaded like any standard Hugging Face model:")
print(f"""
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "{MERGED_DIR}",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
# No PEFT library needed! Standard inference.
""")

# Clean up
del merged_model
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

10. Complete QLoRA Pipeline β€” Reference CodeΒΆ

The following cell contains the complete, minimal QLoRA fine-tuning pipeline in one place. Use this as a starting template for your own projects.

"""
Complete QLoRA Fine-tuning Pipeline β€” Reference Template
=========================================================
Copy this cell as a standalone script for your own projects.

Requirements:
    pip install transformers trl peft bitsandbytes datasets accelerate
"""

# --- [0] Configuration ---
CONFIG = {
    "model_id":            "Qwen/Qwen2.5-1.5B-Instruct",
    "dataset_id":          "tatsu-lab/alpaca",
    "output_dir":          "./qlora-output",

    # LoRA
    "lora_r":              16,
    "lora_alpha":          32,
    "lora_dropout":        0.05,
    "use_dora":            False,      # set True for better quality
    "use_rslora":          False,      # set True for r >= 32

    # Quantization
    "quant_type":          "nf4",      # "nf4" or "fp4"
    "double_quant":        True,

    # Training
    "max_seq_length":      1024,
    "batch_size":          2,
    "grad_accum":          8,
    "learning_rate":       2e-4,
    "num_epochs":          1,
    "warmup_ratio":        0.03,
    "optimizer":           "paged_adamw_8bit",
}

print("Complete QLoRA Pipeline β€” Configuration")
print("=" * 55)
for k, v in CONFIG.items():
    print(f"  {k:<22}: {v}")

print()
print("""
Pipeline steps (reference β€” already executed above):
─────────────────────────────────────────────────────

1. Configure BitsAndBytesConfig
   bnb_config = BitsAndBytesConfig(
       load_in_4bit=True,
       bnb_4bit_quant_type=CONFIG["quant_type"],
       bnb_4bit_compute_dtype=torch.bfloat16,
       bnb_4bit_use_double_quant=CONFIG["double_quant"],
   )

2. Load tokenizer + 4-bit quantized model
   tokenizer = AutoTokenizer.from_pretrained(CONFIG["model_id"])
   model = AutoModelForCausalLM.from_pretrained(
       CONFIG["model_id"], quantization_config=bnb_config, device_map="auto"
   )

3. Prepare for k-bit training
   model = prepare_model_for_kbit_training(model)

4. Apply LoRA adapters
   lora_cfg = LoraConfig(r=CONFIG["lora_r"], ...)
   model = get_peft_model(model, lora_cfg)

5. Load and format dataset
   dataset = load_dataset(CONFIG["dataset_id"])
   dataset = dataset.map(format_alpaca)

6. Configure SFTTrainer and train
   trainer = SFTTrainer(model, tokenizer, args, train_dataset, eval_dataset)
   trainer.train()

7. Save adapter
   model.save_pretrained("./adapter")

8. (Optional) Merge + save full model
   merged = model.merge_and_unload()
   merged.save_pretrained("./merged")
""")

Summary β€” Key TakeawaysΒΆ

QLoRA Architecture RecapΒΆ

QLoRA = NF4 quantized base (frozen) + LoRA adapters (trainable in BF16)

Memory savings vs full SFT:
  7B model: ~86 GB β†’ ~6 GB   (14Γ— reduction)
  70B model: ~860 GB β†’ ~40 GB (21Γ— reduction)

When to Use Each PEFT MethodΒΆ

Scenario

Method

Why

VRAM > 2Γ— model BF16 size

LoRA

Faster, slightly better quality

VRAM < 2Γ— model BF16 size

QLoRA

Only option for large models

Need best quality at same params

DoRA

+2-4% benchmark improvement

Using high ranks (rβ‰₯32)

RSLoRA

Stable gradients, better convergence

Best quality, enough VRAM

DoRA+RSLoRA

Combines both improvements

Tiny GPU, minimal params

IA3 / Prompt tuning

<0.1% parameters

QLoRA Best Practices ChecklistΒΆ

[x] Use NF4 quantization (not FP4) for transformer models
[x] Enable double quantization (saves ~3GB extra on 70B models)
[x] Set compute_dtype to bfloat16 (Ampere+) or float16 (older)
[x] Use prepare_model_for_kbit_training() before applying LoRA
[x] Use paged_adamw_8bit optimizer for minimal memory spikes
[x] Set LR = 2e-4 to 3e-4 (higher than full SFT)
[x] Set max_grad_norm = 0.3 (conservative for 4-bit)
[x] Enable gradient checkpointing (reduces activation memory)
[x] Enable sequence packing (2-3x throughput improvement)
[x] Target both attention AND MLP layers for best quality
[x] Consider DoRA=True for +2-4% quality with no extra memory
[x] Save only the adapter (~MB), not the full model (~GB)

Memory Quick ReferenceΒΆ

GPU VRAM    β†’ Maximum model size with QLoRA
4 GB        β†’ 1.5B–3B parameters
6 GB        β†’ 3B–7B parameters
8 GB        β†’ 7B parameters comfortably
12 GB       β†’ 7B–13B parameters
16 GB       β†’ 13B parameters
24 GB       β†’ 13B–30B parameters
40 GB       β†’ 30B–70B parameters
80 GB       β†’ 70B parameters (A100/H100 class)

Next StepsΒΆ

  • Notebook 05 β€” Instruction Tuning: curate better training datasets

  • Notebook 06 β€” Evaluation: measure model quality rigorously

  • Notebook 07 β€” RLHF/DPO: align fine-tuned model behavior

  • Notebook 08 β€” Deployment: serve your fine-tuned model in production

References: