QLoRA β Memory-Efficient Fine-Tuning on Consumer GPUsΒΆ
Phase 12 β LLM Fine-tuning | Notebook 04
What You Will LearnΒΆ
QLoRA architecture: 4-bit quantized base model + LoRA adapters
BitsAndBytes quantization: NF4 vs FP4 β when each is appropriate
How QLoRA makes fine-tuning a 70B model possible on a 24 GB GPU
Configuring
BitsAndBytesConfigcorrectlyPaged AdamW optimizer β why it matters for large models
Nested (double) quantization β extra memory savings
QLoRA vs LoRA β decision guide by hardware and task
Memory calculation formulas
DoRA (Weight-Decomposed LoRA) β often better than standard LoRA
RSLoRA (Rank-Stabilized LoRA) β stable at high rank values
PEFT method comparison benchmarks
Complete runnable QLoRA example from data to inference
Model used: Qwen/Qwen2.5-1.5B-Instruct
Minimum GPU: 4 GB VRAM (1.5B model in 4-bit)
QLoRA ArchitectureΒΆ
QLoRA = 4-bit Quantized Base Model + LoRA Adapters (16-bit)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Transformer Layer β
β β
β Input βββΊ [ W_frozen (NF4 4-bit) ] βββΊ BF16 output β
β + β
β [ B Β· A (BF16 LoRA) ] βββΊ BF16 delta β
β β β
β Only these weights β
β are trained! β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Memory breakdown for a 7B model:
Base weights (NF4 4-bit) : ~3.5 GB β frozen, 4x compression
LoRA adapters (BF16) : ~0.2 GB β trainable
Gradients : ~0.2 GB β only for LoRA
Optimizer states : ~0.4 GB β only for LoRA
Activations : ~1.5 GB β depends on batch/seq
βββββββββββββββββββββββββββββββββββββ
Total : ~5.8 GB β fits in 6 GB GPU!
Original paper: Dettmers et al. (2023), QLoRA: Efficient Finetuning of Quantized LLMs
# Install required libraries
# bitsandbytes is the key library for 4-bit quantization
!pip install -q transformers>=4.40.0 trl>=0.8.6 peft>=0.10.0 datasets>=2.18.0 \
accelerate>=0.29.0 bitsandbytes>=0.43.0 matplotlib torch
1. BitsAndBytes Quantization: NF4 vs FP4ΒΆ
Quantization maps floating point weights to a smaller numerical format. BitsAndBytes supports two 4-bit formats:
NF4 β Normal Float 4ΒΆ
NF4 is the key innovation in QLoRA. It exploits the observation that pre-trained neural network weights follow an approximately normal distribution centered at zero.
NF4 places quantization levels unevenly β more levels near zero (where most weights cluster) and fewer at the extremes. This is information-theoretically optimal for normally distributed data.
Standard INT4 (uniform levels):
βββββββββββββββββββββββββββββ
β -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 β
ββ|ββ|ββ|ββ|ββ|ββ|ββ|ββ|ββ|βββ (equally spaced)
NF4 (non-uniform, optimized for normal distribution):
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β -1.0 -0.5 0 0.5 1.0 β
|β||β|||||||β|||β||||||β|ββ|β| (dense near 0, sparse at extremes)
NF4 achieves lower quantization error than INT4 or FP4 for weights that follow a normal distribution.
FP4 β Float Point 4ΒΆ
FP4 uses 1 sign bit + 2 exponent bits + 1 mantissa bit, mimicking the structure of FP32/BF16. It handles a wider dynamic range than NF4 but is less optimal for normally distributed weights.
Comparison TableΒΆ
Format |
Best For |
Quantization Error |
Memory |
|---|---|---|---|
FP32 |
Training, high precision |
None (baseline) |
4 bytes/param |
BF16 |
Training (Ampere+) |
Minimal |
2 bytes/param |
FP16 |
Training (V100, T4) |
Minimal |
2 bytes/param |
INT8 |
Inference |
Low |
1 byte/param |
NF4 |
QLoRA fine-tuning (normal dist.) |
Low (best for LLMs) |
0.5 bytes/param |
FP4 |
QLoRA fine-tuning (general) |
Moderate |
0.5 bytes/param |
INT4 |
Inference only |
High |
0.5 bytes/param |
For QLoRA, always use NF4 unless you have a specific reason to use FP4.
How BitsAndBytes Quantization WorksΒΆ
BitsAndBytes uses blockwise quantization:
Divide weights into blocks of 64 elements
Find the maximum absolute value in each block (the scale)
Normalize block to [-1, 1] range
Quantize to NF4/FP4 levels
Store: quantized values (4-bit) + per-block scale (FP32)
The per-block scale adds ~0.5 bits/param overhead β effective ~4.5 bits/param total.
import os
import sys
import gc
import math
import warnings
from typing import Dict, Optional, Tuple
import torch
import transformers
import trl
import peft
import datasets as ds_lib
import bitsandbytes as bnb
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np
warnings.filterwarnings("ignore")
print("=" * 55)
print("Environment Check")
print("=" * 55)
print(f"Python : {sys.version.split()[0]}")
print(f"PyTorch : {torch.__version__}")
print(f"Transformers : {transformers.__version__}")
print(f"TRL : {trl.__version__}")
print(f"PEFT : {peft.__version__}")
print(f"BitsAndBytes : {bnb.__version__}")
print(f"Datasets : {ds_lib.__version__}")
print()
if torch.cuda.is_available():
device = "cuda"
gpu_name = torch.cuda.get_device_name(0)
vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
major, minor = torch.cuda.get_device_capability()
print(f"GPU : {gpu_name}")
print(f"VRAM : {vram_gb:.1f} GB")
print(f"Compute Cap. : SM{major}{minor}")
print(f"BF16 supported : {torch.cuda.is_bf16_supported()}")
print(f"4-bit supported : {major >= 7} (SM70+ Volta or newer)")
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
device = "mps"
print("GPU : Apple Silicon (MPS)")
print("Note: 4-bit quantization not supported on MPS. Using CPU fallback.")
else:
device = "cpu"
print("GPU : Not available")
print(f"\nDevice : {device}")
print("=" * 55)
2. Memory Calculation: Why QLoRA Enables 70B on Consumer GPUsΒΆ
Let us do the math precisely.
Full Fine-Tuning Memory RequirementsΒΆ
For a model with P parameters trained in FP16/BF16:
Component |
Formula |
7B example |
70B example |
|---|---|---|---|
Model weights |
P Γ 2 bytes |
14 GB |
140 GB |
Gradients |
P Γ 2 bytes |
14 GB |
140 GB |
AdamW momentum |
P Γ 4 bytes |
28 GB |
280 GB |
AdamW variance |
P Γ 4 bytes |
28 GB |
280 GB |
Activations |
~batch Γ seq Γ hidden Γ layers Γ 2 |
2β8 GB |
4β16 GB |
Total |
~12 Γ P bytes |
~86 GB |
~860 GB |
QLoRA Memory RequirementsΒΆ
With base model quantized to NF4 and only LoRA adapters trained:
Component |
Formula |
7B example |
70B example |
|---|---|---|---|
Base model (NF4) |
P Γ 0.5 bytes |
3.5 GB |
35 GB |
LoRA weights (BF16) |
r Γ (d_in + d_out) Γ L Γ 2 |
~0.2 GB |
~0.5 GB |
Gradients (LoRA only) |
same as LoRA |
~0.2 GB |
~0.5 GB |
Paged AdamW (LoRA) |
LoRA Γ 8 bytes |
~0.4 GB |
~1.0 GB |
Activations |
same formula |
1β4 GB |
2β8 GB |
Total |
~5.3β8 GB |
~39β45 GB |
Compression RatioΒΆ
7B model: Full SFT needs ~86 GB β QLoRA needs ~6 GB β 14x reduction
70B model: Full SFT needs ~860 GB β QLoRA needs ~40 GB β 21x reduction
This is why QLoRA is transformative: a single RTX 3090 (24 GB) can fine-tune a 70B parameter model that would otherwise require 10Γ A100 GPUs.
def calculate_model_memory(
params_billions: float,
lora_rank: int = 16,
hidden_size: int = 4096,
num_layers: int = 32,
num_lora_target_modules: int = 4, # q, k, v, o projections
) -> Dict[str, float]:
"""
Calculate GPU memory requirements for different training strategies.
Returns memory in GB for each component and strategy.
"""
P = params_billions * 1e9
# LoRA parameter count: r * (d_in + d_out) per module
lora_params = lora_rank * 2 * hidden_size * num_layers * num_lora_target_modules
strategies = {}
# Full fine-tuning in FP32
strategies["Full SFT (FP32)"] = {
"model": P * 4 / 1e9,
"gradients": P * 4 / 1e9,
"optimizer": P * 8 / 1e9, # AdamW: 2 Γ FP32 states
"activations": params_billions * 0.5, # rough estimate
}
# Full fine-tuning in BF16
strategies["Full SFT (BF16)"] = {
"model": P * 2 / 1e9,
"gradients": P * 2 / 1e9,
"optimizer": P * 8 / 1e9, # optimizer states kept in FP32
"activations": params_billions * 0.4,
}
# LoRA in BF16 (base model not quantized)
strategies["LoRA (BF16 base)"] = {
"model": P * 2 / 1e9,
"gradients": lora_params * 2 / 1e9,
"optimizer": lora_params * 8 / 1e9,
"activations": params_billions * 0.4,
}
# QLoRA: 4-bit base + LoRA adapters in BF16
strategies["QLoRA (NF4 base)"] = {
"model": P * 0.5 / 1e9, # 4-bit = ~0.5 bytes/param
"gradients": lora_params * 2 / 1e9, # only LoRA gradients
"optimizer": lora_params * 8 / 1e9, # only LoRA optimizer states
"activations": params_billions * 0.3,
}
for name, breakdown in strategies.items():
breakdown["total"] = sum(breakdown.values())
return strategies
# Print memory tables for different model sizes
model_configs = [
{"name": "1.5B", "params": 1.5, "hidden": 1536, "layers": 28},
{"name": "7B", "params": 7.0, "hidden": 4096, "layers": 32},
{"name": "13B", "params": 13.0,"hidden": 5120, "layers": 40},
{"name": "70B", "params": 70.0,"hidden": 8192, "layers": 80},
]
for cfg in model_configs:
print(f"\n{'='*65}")
print(f"Model: {cfg['name']} parameters (hidden={cfg['hidden']}, layers={cfg['layers']})")
print(f"{'='*65}")
strategies = calculate_model_memory(
cfg["params"], hidden_size=cfg["hidden"], num_layers=cfg["layers"]
)
print(f"{'Strategy':<22} {'Model':>7} {'Grads':>7} {'Optim':>7} {'Activ':>7} {'TOTAL':>8}")
print("-" * 65)
for strategy_name, mem in strategies.items():
print(
f"{strategy_name:<22} "
f"{mem['model']:>6.1f}G "
f"{mem['gradients']:>6.2f}G "
f"{mem['optimizer']:>6.2f}G "
f"{mem['activations']:>6.1f}G "
f"{mem['total']:>7.1f}G"
)
3. Configuring BitsAndBytesConfigΒΆ
BitsAndBytesConfig is the Hugging Face wrapper for bitsandbytes quantization settings. Here are all the key parameters explained:
Core ParametersΒΆ
BitsAndBytesConfig(
load_in_4bit=True, # Enable 4-bit loading (vs load_in_8bit)
bnb_4bit_quant_type="nf4", # "nf4" (best for LLMs) or "fp4"
bnb_4bit_compute_dtype=bfloat16, # Upcast to this dtype for forward pass
bnb_4bit_use_double_quant=True, # Double quantization (nested quant)
)
bnb_4bit_compute_dtypeΒΆ
Weights are stored in 4-bit but computations (matrix multiplications) are upcasted to this dtype. Think of it as:
Storage: NF4 (0.5 bytes/param) β saves memory
Compute: BF16 or FP16 β maintains numerical stability
Use torch.bfloat16 on Ampere+ GPUs, torch.float16 on older GPUs.
Nested Quantization (Double Quant)ΒΆ
The quantization scale factors themselves (normally FP32) are also quantized to 8-bit. This adds another ~0.37 bits/param in savings:
Without double quant: ~4.5 bits/param effective
With double quant: ~4.13 bits/param effective
For a 70B model: saves an additional ~3 GB of VRAM. Always enable it.
8-bit Loading (Alternative)ΒΆ
BitsAndBytesConfig(load_in_8bit=True) # INT8 quantization
Uses LLM.int8() mixed-precision decomposition. Better quality than 4-bit but uses 2x the memory. Good for inference when quality matters more than memory.
from transformers import BitsAndBytesConfig
# Determine compute dtype based on GPU capability
if torch.cuda.is_available() and torch.cuda.is_bf16_supported():
compute_dtype = torch.bfloat16
compute_dtype_str = "bfloat16"
else:
compute_dtype = torch.float16
compute_dtype_str = "float16"
print("BitsAndBytesConfig Configurations")
print("=" * 55)
# ββ Configuration 1: NF4 with double quantization (recommended) ββ
bnb_config_nf4 = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Normal Float 4 β optimal for LLM weights
bnb_4bit_compute_dtype=compute_dtype, # upcast to BF16/FP16 for matmul
bnb_4bit_use_double_quant=True, # quantize the scale factors too
)
print("\nConfig 1: NF4 + Double Quantization (RECOMMENDED for QLoRA)")
print(f" quant_type : nf4")
print(f" compute_dtype : {compute_dtype_str}")
print(f" double_quant : True")
print(f" effective bits : ~4.13 bits/param")
print(f" 7B model VRAM : ~3.5 GB")
# ββ Configuration 2: NF4 without double quantization ββ
bnb_config_nf4_nodq = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=False, # no nested quantization
)
print("\nConfig 2: NF4 without Double Quantization")
print(f" quant_type : nf4")
print(f" compute_dtype : {compute_dtype_str}")
print(f" double_quant : False")
print(f" effective bits : ~4.5 bits/param")
print(f" 7B model VRAM : ~3.9 GB")
# ββ Configuration 3: FP4 (alternative) ββ
bnb_config_fp4 = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="fp4", # Float Point 4 (less optimal for LLMs)
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=True,
)
print("\nConfig 3: FP4 + Double Quantization")
print(f" quant_type : fp4")
print(f" compute_dtype : {compute_dtype_str}")
print(f" double_quant : True")
print(f" note : Higher quantization error for LLMs than NF4")
# ββ Configuration 4: INT8 (for inference or high-quality 8-bit) ββ
bnb_config_int8 = BitsAndBytesConfig(
load_in_8bit=True, # INT8 via LLM.int8()
llm_int8_threshold=6.0, # outlier threshold for mixed-precision
llm_int8_has_fp16_weight=False,
)
print("\nConfig 4: INT8 (LLM.int8())")
print(f" load_in_8bit : True")
print(f" threshold : 6.0 (outlier columns stay in FP16)")
print(f" effective bits : ~8.5 bits/param")
print(f" 7B model VRAM : ~7 GB")
print(f" note : Better quality than 4-bit, 2x more memory")
print()
print("Using bnb_config_nf4 (Config 1) for all further examples.")
# This is the config we will use
BNB_CONFIG = bnb_config_nf4
4. Paged AdamW OptimizerΒΆ
The Problem: Optimizer State Memory SpikesΒΆ
Standard AdamW keeps optimizer states (momentum + variance) for all trainable parameters in GPU VRAM at all times. For large models, this causes:
Sudden OOM crashes during gradient updates
Inability to use large batch sizes
GPU memory fragmentation over time
The Solution: Paged MemoryΒΆ
NVIDIAβs unified memory system allows GPU memory to be paged to CPU RAM when the GPU runs out, similar to how virtual memory pages from RAM to disk in operating systems.
Paged AdamW (paged_adamw_32bit, paged_adamw_8bit) stores optimizer states in pageable GPU memory. When VRAM runs low:
Optimizer states are evicted to CPU RAM
GPU continues processing
States are paged back when needed
This eliminates OOM crashes from optimizer state spikes without sacrificing quality.
Available Optimizers for QLoRAΒΆ
Optimizer |
Memory |
Speed |
Quality |
Use Case |
|---|---|---|---|---|
|
High |
Fast |
Best |
Enough VRAM |
|
High |
Fastest |
Best |
Large GPU |
|
Medium |
Moderate |
Best |
Limited VRAM |
|
Low |
Moderate |
Near-best |
Minimal VRAM |
|
Low |
Fast |
Near-best |
Production QLoRA |
|
Lowest |
Fast |
Lower |
Research |
Recommendation: Use paged_adamw_8bit for QLoRA on GPUs with <16 GB VRAM.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import prepare_model_for_kbit_training
MODEL_ID = "Qwen/Qwen2.5-1.5B-Instruct"
def get_gpu_memory_gb() -> Tuple[float, float]:
"""Return (allocated_GB, total_GB) GPU memory."""
if torch.cuda.is_available():
allocated = torch.cuda.memory_allocated() / 1e9
total = torch.cuda.get_device_properties(0).total_memory / 1e9
return allocated, total
return 0.0, 0.0
# --- Memory before loading ---
mem_before, total_vram = get_gpu_memory_gb()
print(f"VRAM before loading: {mem_before:.2f} GB / {total_vram:.1f} GB")
print()
# --- Load tokenizer ---
print(f"Loading tokenizer: {MODEL_ID}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
print(f"Vocab size: {tokenizer.vocab_size:,}")
print(f"Pad token : {tokenizer.pad_token!r} (id={tokenizer.pad_token_id})")
print()
# --- Load 4-bit quantized model ---
print(f"Loading 4-bit quantized model: {MODEL_ID}")
print(f"Quantization: NF4 + double quant + {compute_dtype_str} compute")
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=BNB_CONFIG, # 4-bit NF4 with double quantization
device_map="auto", # auto-distribute across available GPUs
trust_remote_code=True,
attn_implementation="eager", # use "flash_attention_2" if available
)
# --- Memory after loading ---
mem_after, _ = get_gpu_memory_gb()
print()
print(f"VRAM after loading : {mem_after:.2f} GB / {total_vram:.1f} GB")
print(f"Model memory : {mem_after - mem_before:.2f} GB")
print()
# --- Prepare model for k-bit training ---
# This function:
# 1. Enables gradient checkpointing
# 2. Upcasts LayerNorm weights to FP32 (for numerical stability)
# 3. Sets model in training mode
model = prepare_model_for_kbit_training(
model,
use_gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
)
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters : {total_params:,} ({total_params/1e9:.2f}B)")
print("Model ready for LoRA adapter attachment.")
5. QLoRA vs LoRA: Decision GuideΒΆ
Both QLoRA and LoRA train only a small fraction of parameters via low-rank adapter matrices. The key difference is the base model precision:
LoRA |
QLoRA |
|
|---|---|---|
Base model |
BF16/FP16 (full precision) |
NF4 4-bit (quantized) |
Base model memory |
2 bytes/param |
0.5 bytes/param |
Training quality |
Slightly higher |
Very close to LoRA |
Minimum VRAM (7B) |
~16 GB |
~6 GB |
Speed |
Faster matmul |
~15% slower (dequant overhead) |
Quantization noise |
None |
Small |
When to Use LoRA (not QLoRA)ΒΆ
You have 16+ GB VRAM and are fine-tuning a 7B model
You need absolute maximum quality (competition/research)
Speed is more critical than memory efficiency
When to Use QLoRAΒΆ
Consumer GPU with 8β24 GB VRAM
Fine-tuning 7B+ models on limited hardware
Cost-sensitive (renting cloud GPUs by the hour)
The model is large enough that BF16 loading is impossible
Practical Rule of ThumbΒΆ
VRAM available > 2 Γ model_size_BF16? β Use LoRA
Otherwise? β Use QLoRA
Examples:
24 GB GPU, 7B model (14 GB in BF16) β LoRA (24 > 14)
16 GB GPU, 13B model (26 GB in BF16) β QLoRA (16 < 26)
8 GB GPU, 7B model (14 GB in BF16) β QLoRA (8 < 14)
from peft import LoraConfig, get_peft_model, TaskType
# Find which linear layer names exist in this model
# This is needed to set target_modules correctly
def find_linear_layer_names(model) -> list:
"""Find all linear layer module names suitable for LoRA."""
linear_names = set()
for name, module in model.named_modules():
if isinstance(module, (torch.nn.Linear, bnb.nn.Linear4bit, bnb.nn.Linear8bitLt)):
# Get the last part of the name (e.g., "q_proj" from "model.layers.0.self_attn.q_proj")
parts = name.split(".")
last_name = parts[-1]
# Skip output embedding layers
if last_name not in ["lm_head", "embed_tokens"]:
linear_names.add(last_name)
return sorted(list(linear_names))
all_linear_layers = find_linear_layer_names(model)
print("All trainable linear layers in the model:")
for name in all_linear_layers:
print(f" {name}")
# Standard target modules for transformer models:
# q_proj, k_proj, v_proj, o_proj = attention projections
# gate_proj, up_proj, down_proj = MLP/FFN projections
# Targeting all of them (vs just q+v) gives better quality at slightly more memory
ATTENTION_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj"]
MLP_MODULES = ["gate_proj", "up_proj", "down_proj"]
# Use modules that actually exist in this model
target_attention = [m for m in ATTENTION_MODULES if m in all_linear_layers]
target_mlp = [m for m in MLP_MODULES if m in all_linear_layers]
target_modules = target_attention + target_mlp
print(f"\nSelected target modules: {target_modules}")
print(f" Attention: {target_attention}")
print(f" MLP/FFN: {target_mlp}")
# -----------------------------------------------------------------------
# LoRA Configuration for QLoRA
# -----------------------------------------------------------------------
lora_config = LoraConfig(
# --- Core LoRA parameters ---
r=16, # Rank: higher = more capacity, more memory
# Typical values: 8, 16, 32, 64
# r=8: minimal; r=64: near full fine-tuning quality
lora_alpha=32, # Scaling factor: effective LR = (alpha/r) Γ LR
# Convention: set to 2Γr (so scaling=2.0)
# Higher alpha = stronger adapter influence
lora_dropout=0.05, # Dropout applied to LoRA layers
# 0.05β0.1 is typical; 0.0 for small datasets
bias="none", # Whether to train bias terms
# "none" = no bias (most common)
# "all" = train all biases
# "lora_only" = train only LoRA biases
# --- Target modules ---
target_modules=target_modules, # Which linear layers to add LoRA to
# More modules = better quality, more params
# --- Task type ---
task_type=TaskType.CAUSAL_LM, # Autoregressive language modeling
# --- Advanced options (will explore below) ---
use_dora=False, # DoRA: Weight-Decomposed LoRA (see section 7)
use_rslora=False, # RSLoRA: Rank-Stabilized (see section 8)
# init_lora_weights="gaussian", # Weight initialization (default: kaiming)
)
# Apply LoRA to the quantized model
model = get_peft_model(model, lora_config)
# Count parameters
def print_param_summary(model):
total = sum(p.numel() for p in model.parameters())
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
frozen = total - trainable
print(f" Total params : {total:>15,} ({total/1e9:.3f}B)")
print(f" Trainable (LoRA): {trainable:>15,} ({trainable/total*100:.3f}%)")
print(f" Frozen (base) : {frozen:>15,} ({frozen/total*100:.3f}%)")
print("LoRA applied to quantized model:")
print_param_summary(model)
print()
mem_with_lora, _ = get_gpu_memory_gb()
print(f"VRAM with base + LoRA: {mem_with_lora:.2f} GB")
print()
model.print_trainable_parameters()
6. DoRA β Weight-Decomposed Low-Rank AdaptationΒΆ
Paper: Liu et al. (2024), DoRA: Weight-Decomposed Low-Rank Adaptation
DoRA is a drop-in improvement over LoRA that consistently achieves better performance with the same parameter budget.
The InsightΒΆ
DoRA decomposes each weight matrix into magnitude and direction components (inspired by weight normalization):
\(W_0\): frozen pre-trained weight
\(B\), \(A\): trainable LoRA matrices (low-rank, as in standard LoRA)
\(m\): trainable magnitude vector (one scalar per output feature)
Why It Works BetterΒΆ
Standard LoRA is constrained to make updates in a low-rank subspace. This limits how independently it can adjust the magnitude vs the direction of each weight vector.
DoRA separates these, allowing:
Magnitude adjustments without changing direction (and vice versa)
Learning patterns more similar to full fine-tuning
Better performance at the same rank
Benchmark Results (from the DoRA paper)ΒΆ
Method |
Commonsense (avg) |
MMLU |
MT-Bench |
Extra params |
|---|---|---|---|---|
Full FT |
78.6 |
56.1 |
6.8 |
100% |
LoRA (r=32) |
74.0 |
53.2 |
6.4 |
0.8% |
DoRA (r=32) |
76.8 |
55.0 |
6.6 |
0.8% |
DoRA consistently outperforms LoRA by 2-4% across tasks with identical parameter counts.
When to Use DoRAΒΆ
When LoRA performance is close but not quite good enough
When you want closer-to-full-FT quality without increasing rank
When training on complex reasoning or instruction following tasks
Minor drawback: ~5% slower than LoRA due to extra normalization.
# DoRA configuration β only requires use_dora=True in LoraConfig
dora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
target_modules=target_modules,
task_type=TaskType.CAUSAL_LM,
use_dora=True, # <ββ the only change from standard LoRA
use_rslora=False,
)
print("DoRA Config (use_dora=True)")
print(f" rank : {dora_config.r}")
print(f" lora_alpha : {dora_config.lora_alpha}")
print(f" use_dora : {dora_config.use_dora}")
print()
# Compare parameter counts: DoRA vs LoRA
print("Parameter comparison for a 7B model (hypothetical, r=16):")
print("-" * 60)
# Approximate LoRA param count for 7B model
hidden_size = 4096
num_layers = 32
rank = 16
n_target_mods = 7 # q, k, v, o, gate, up, down
lora_params = 2 * rank * hidden_size * num_layers * n_target_mods
# DoRA adds a magnitude vector per output feature per targeted module
dora_extra = hidden_size * num_layers * n_target_mods # magnitude vectors
dora_params = lora_params + dora_extra
print(f" LoRA params : {lora_params:>12,} ({lora_params/7e9*100:.4f}% of 7B)")
print(f" DoRA params : {dora_params:>12,} ({dora_params/7e9*100:.4f}% of 7B)")
print(f" DoRA overhead: {dora_extra:>12,} (+{dora_extra/lora_params*100:.1f}% vs LoRA)")
print()
print("DoRA adds only minimal extra parameters (~4% more than LoRA)")
print("but consistently achieves better fine-tuning quality.")
7. RSLoRA β Rank-Stabilized LoRAΒΆ
Paper: Kalajdzievski (2023), A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA
The Problem with Standard LoRA ScalingΒΆ
In standard LoRA, the adapter output is scaled by alpha / r:
This scaling factor decreases as rank increases. The problem:
At low rank (r=8): scaling = alpha/8 (relatively large)
At high rank (r=64): scaling = alpha/64 (small, learning is slow)
In practice this means standard LoRA becomes harder to train at high ranks, limiting its effectiveness even when you add more parameters.
RSLoRA Fix: Scale by \(\sqrt{r}\) InsteadΒΆ
RSLoRA replaces the alpha / r scaling with alpha / sqrt(r):
This ensures the gradient norm remains approximately constant regardless of rank, making optimization stable across all rank values.
ImpactΒΆ
Rank |
LoRA scaling |
RSLoRA scaling |
RSLoRA advantage |
|---|---|---|---|
r=8 |
alpha/8 = 4.0 |
alpha/β8 = 11.3 |
~2.8x |
r=16 |
alpha/16 = 2.0 |
alpha/β16 = 8.0 |
~4.0x |
r=32 |
alpha/32 = 1.0 |
alpha/β32 = 5.7 |
~5.7x |
r=64 |
alpha/64 = 0.5 |
alpha/β64 = 4.0 |
~8.0x |
(With alpha=32 as example)
RSLoRA is especially beneficial when using high ranks (r=32+). For low ranks (r=8), the difference is minimal.
When to Use RSLoRAΒΆ
When using rank β₯ 32
When standard LoRA seems to converge slowly
When you want more stable training dynamics
# RSLoRA configuration
rslora_config = LoraConfig(
r=32, # Higher rank benefits more from RSLoRA
lora_alpha=32, # With RSLoRA, alpha/sqrt(r) = 32/sqrt(32) β 5.66
lora_dropout=0.05,
bias="none",
target_modules=target_modules,
task_type=TaskType.CAUSAL_LM,
use_rslora=True, # <ββ enable rank-stabilized scaling
use_dora=False,
)
# DoRA + RSLoRA combined (they are complementary)
dora_rslora_config = LoraConfig(
r=32,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
target_modules=target_modules,
task_type=TaskType.CAUSAL_LM,
use_rslora=True, # rank-stabilized scaling
use_dora=True, # weight decomposition
)
print("Scaling factor comparison (alpha=32):")
print("-" * 50)
print(f"{'Rank':<6} {'LoRA (a/r)':>12} {'RSLoRA (a/βr)':>14} {'Ratio':>8}")
print("-" * 50)
alpha = 32
for r in [4, 8, 16, 32, 64, 128]:
lora_scale = alpha / r
rslora_scale = alpha / math.sqrt(r)
ratio = rslora_scale / lora_scale
print(f"{r:<6} {lora_scale:>12.4f} {rslora_scale:>14.4f} {ratio:>8.2f}x")
print()
print("Summary of LoRA variants:")
print("-" * 50)
variants = [
("Standard LoRA", False, False, "Baseline. Good for rβ€16"),
("RSLoRA", True, False, "Better at rβ₯32; stable gradients"),
("DoRA", False, True, "Better quality; +~5% slower"),
("DoRA + RSLoRA", True, True, "Best of both worlds for rβ₯32"),
]
print(f"{'Variant':<18} {'RSLoRA':>7} {'DoRA':>7} Description")
print("-" * 65)
for name, rslora, dora, desc in variants:
print(f"{name:<18} {str(rslora):>7} {str(dora):>7} {desc}")
8. PEFT Methods ComparisonΒΆ
Here is a comprehensive benchmark comparison of all major PEFT methods. Numbers are from published papers and community reproductions on LLaMA-2-7B.
Quality Benchmarks (LLaMA-2-7B, instruction tuning)ΒΆ
Method |
Trainable % |
MMLU |
MT-Bench |
Commonsense |
Memory (7B) |
Notes |
|---|---|---|---|---|---|---|
Full SFT |
100% |
58.1 |
7.1 |
79.2 |
~86 GB |
Requires A100Γ4 |
LoRA r=64 |
1.6% |
57.2 |
6.9 |
78.3 |
~20 GB |
Near full quality |
LoRA r=16 |
0.4% |
56.5 |
6.7 |
76.8 |
~18 GB |
Best value |
DoRA r=16 |
0.4% |
57.0 |
6.9 |
77.8 |
~18 GB |
Recommended |
QLoRA r=16 |
0.4% |
55.8 |
6.5 |
75.9 |
~6 GB |
Consumer GPUs |
QLoRA+DoRA r=16 |
0.4% |
56.4 |
6.7 |
76.7 |
~6.1 GB |
QLoRA + quality |
Adapters (Houlsby) |
0.9% |
54.1 |
6.1 |
73.2 |
~16 GB |
Sequential bottleneck |
Prefix Tuning |
0.1% |
51.2 |
5.8 |
70.1 |
~14 GB |
Soft prompts only |
Prompt Tuning |
<0.01% |
48.5 |
5.3 |
67.8 |
~14 GB |
Simplest PEFT |
IA3 |
0.01% |
52.1 |
5.9 |
71.3 |
~14 GB |
Few-shot focused |
Memory vs Quality Trade-offΒΆ
Quality
β²
β Full SFT β
β LoRA r=64 β
β DoRA r=16 β
β LoRA r=16 β QDoRA β
β QLoRA r=16 β
β IA3 β Adapters β
β Prefix β
β
ββββββββββββββββββββββββββββββββββΊ Memory
6GB 16GB 20GB 86GB
Sweet spots: LoRA r=16 (16GB), QLoRA r=16 (6GB)
Speed Comparison (tokens/second on A100 80GB, 7B model, seq_len=2048)ΒΆ
Method |
Tokens/sec |
Relative Speed |
|---|---|---|
Full SFT |
1,850 |
1.00Γ (baseline) |
LoRA + FA2 |
2,100 |
1.14Γ |
DoRA |
1,980 |
1.07Γ |
QLoRA |
1,200 |
0.65Γ |
QLoRA is slower due to dequantization overhead on every forward pass.
# Visualize PEFT methods: quality vs memory trade-off
methods = [
# (name, trainable_pct, memory_7b_gb, mmlu_score)
("Full SFT", 100.0, 86.0, 58.1),
("LoRA r=64", 1.6, 20.0, 57.2),
("DoRA r=16", 0.4, 18.5, 57.0),
("LoRA r=16", 0.4, 18.0, 56.5),
("QLoRA+DoRA r=16",0.4, 6.1, 56.4),
("QLoRA r=16", 0.4, 6.0, 55.8),
("Adapters", 0.9, 16.0, 54.1),
("IA3", 0.01, 14.0, 52.1),
("Prefix Tuning", 0.1, 14.0, 51.2),
("Prompt Tuning", 0.005, 14.0, 48.5),
]
names = [m[0] for m in methods]
memory = [m[2] for m in methods]
quality = [m[3] for m in methods]
trainable = [m[1] for m in methods]
# Color by category
colors = [
"#e74c3c", # Full SFT β red
"#3498db", # LoRA r=64 β blue
"#9b59b6", # DoRA r=16 β purple
"#2980b9", # LoRA r=16 β blue (darker)
"#1abc9c", # QLoRA+DoRA β teal
"#27ae60", # QLoRA r=16 β green
"#f39c12", # Adapters β orange
"#95a5a6", # IA3 β gray
"#bdc3c7", # Prefix β light gray
"#ecf0f1", # Prompt β very light
]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
# Plot 1: Memory vs Quality
scatter = ax1.scatter(memory, quality, s=200, c=colors, edgecolors="#2c3e50",
linewidths=1.5, zorder=5)
for i, name in enumerate(names):
offset_x = 0.5
offset_y = 0.1
# Avoid overlap for close points
if name in ["LoRA r=16", "DoRA r=16", "QLoRA+DoRA r=16"]:
offset_y = -0.3 if i % 2 == 0 else 0.2
ax1.annotate(name, (memory[i], quality[i]),
xytext=(offset_x, offset_y), textcoords="offset points",
fontsize=8, ha="left")
ax1.set_xlabel("GPU Memory Required β 7B Model (GB)", fontsize=11)
ax1.set_ylabel("MMLU Score", fontsize=11)
ax1.set_title("PEFT Methods: Memory vs Quality Trade-off", fontsize=12)
ax1.grid(True, alpha=0.3)
ax1.set_xlim(-2, 95)
ax1.set_ylim(46, 60)
# Highlight the recommended zone
ax1.axvspan(0, 10, alpha=0.08, color="green", label="Consumer GPU zone")
ax1.axvspan(10, 30, alpha=0.05, color="blue", label="Workstation GPU zone")
ax1.legend(fontsize=9, loc="lower right")
# Plot 2: Trainable parameters vs Quality (bar chart)
sorted_idx = sorted(range(len(quality)), key=lambda i: quality[i], reverse=True)
sorted_names = [names[i] for i in sorted_idx]
sorted_quality = [quality[i] for i in sorted_idx]
sorted_colors = [colors[i] for i in sorted_idx]
bars = ax2.barh(range(len(sorted_names)), sorted_quality,
color=sorted_colors, edgecolor="#2c3e50", linewidth=0.8)
ax2.set_yticks(range(len(sorted_names)))
ax2.set_yticklabels(sorted_names, fontsize=9)
ax2.set_xlabel("MMLU Score", fontsize=11)
ax2.set_title("PEFT Methods: Ranked by Quality", fontsize=12)
ax2.set_xlim(45, 61)
ax2.axvline(58.1, color="#e74c3c", linestyle="--", alpha=0.5, label="Full SFT")
ax2.legend(fontsize=9)
ax2.grid(True, alpha=0.3, axis="x")
# Add score labels on bars
for i, (bar, score) in enumerate(zip(bars, sorted_quality)):
ax2.text(score + 0.05, bar.get_y() + bar.get_height()/2,
f"{score}", va="center", fontsize=8)
plt.tight_layout()
plt.savefig("peft_comparison.png", dpi=120, bbox_inches="tight")
plt.show()
print("Saved: peft_comparison.png")
from datasets import load_dataset
# Load dataset (same alpaca format as notebook 02)
print("Loading dataset for QLoRA training...")
raw_dataset = load_dataset("tatsu-lab/alpaca", split="train")
# Format function
ALPACA_PROMPT_WITH_INPUT = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:
{output}"""
ALPACA_PROMPT_WITHOUT_INPUT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:
{output}"""
def format_alpaca(example: Dict) -> Dict:
if example.get("input", "").strip():
text = ALPACA_PROMPT_WITH_INPUT.format(
instruction=example["instruction"],
input=example["input"],
output=example["output"],
)
else:
text = ALPACA_PROMPT_WITHOUT_INPUT.format(
instruction=example["instruction"],
output=example["output"],
)
return {"text": text}
formatted = raw_dataset.map(
format_alpaca,
remove_columns=raw_dataset.column_names,
desc="Formatting dataset",
)
split = formatted.train_test_split(test_size=0.02, seed=42)
train_dataset = split["train"]
eval_dataset = split["test"]
print(f"Train: {len(train_dataset):,} examples")
print(f"Eval : {len(eval_dataset):,} examples")
print()
print("Sample:")
print(train_dataset[0]["text"][:300])
print("...")
from trl import SFTTrainer, SFTConfig
OUTPUT_DIR = "./qlora-qwen2.5-1.5b-alpaca"
MAX_SEQ_LENGTH = 1024
# -----------------------------------------------------------------------
# QLoRA-specific training arguments
# -----------------------------------------------------------------------
qlora_training_args = SFTConfig(
output_dir=OUTPUT_DIR,
# --- Dataset ---
max_seq_length=MAX_SEQ_LENGTH,
dataset_text_field="text",
packing=True,
dataset_num_proc=4,
# --- Batch / accumulation ---
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
gradient_accumulation_steps=8, # effective batch = 16
# --- Optimizer: paged AdamW for QLoRA ---
# paged_adamw_8bit: 8-bit optimizer states stored in pageable memory
# This is the key optimizer choice for memory-constrained QLoRA
optim="paged_adamw_8bit",
learning_rate=2e-4, # Higher LR for LoRA vs full SFT
weight_decay=0.001,
max_grad_norm=0.3, # More conservative clipping for 4-bit
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-8,
# --- LR schedule ---
num_train_epochs=1,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
# --- Mixed precision ---
# For QLoRA: do NOT use fp16=True if using bfloat16 compute dtype
# The base model is stored as NF4 and computation is in bfloat16
bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(),
fp16=torch.cuda.is_available() and not torch.cuda.is_bf16_supported(),
# --- Memory ---
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
# --- Checkpointing ---
save_strategy="steps",
save_steps=200,
save_total_limit=3,
load_best_model_at_end=True,
# --- Evaluation ---
eval_strategy="steps",
eval_steps=200,
# --- Logging ---
logging_steps=10,
logging_first_step=True,
report_to="none",
seed=42,
)
print("QLoRA Training Configuration")
print("=" * 55)
print(f" Optimizer : {qlora_training_args.optim}")
print(f" Learning rate : {qlora_training_args.learning_rate}")
print(f" Effective batch : {qlora_training_args.per_device_train_batch_size * qlora_training_args.gradient_accumulation_steps}")
print(f" BF16 : {qlora_training_args.bf16}")
print(f" Max seq length : {qlora_training_args.max_seq_length}")
print(f" Packing : {qlora_training_args.packing}")
print()
print("Key QLoRA training choices:")
print(" 1. paged_adamw_8bit β optimizer states in pageable GPU memory")
print(" 2. LR = 2e-4 β higher than full SFT (LoRA-specific)")
print(" 3. max_grad_norm=0.3 β conservative clipping for 4-bit training")
print(" 4. packing=True β max throughput with short sequences")
# Memory tracking throughout the QLoRA training pipeline
def memory_snapshot(label: str) -> Dict[str, float]:
"""Capture a labeled GPU memory snapshot."""
snapshot = {"label": label}
if torch.cuda.is_available():
torch.cuda.synchronize()
snapshot["allocated_gb"] = torch.cuda.memory_allocated() / 1e9
snapshot["reserved_gb"] = torch.cuda.memory_reserved() / 1e9
snapshot["total_gb"] = torch.cuda.get_device_properties(0).total_memory / 1e9
snapshot["free_gb"] = snapshot["total_gb"] - snapshot["reserved_gb"]
else:
snapshot["allocated_gb"] = 0.0
snapshot["reserved_gb"] = 0.0
snapshot["total_gb"] = 0.0
snapshot["free_gb"] = 0.0
return snapshot
snapshots = []
# Snapshot: after model load (captured earlier)
snap = memory_snapshot("After 4-bit model load")
snap["allocated_gb"] = mem_with_lora # use our earlier measurement
snapshots.append(snap)
# Initialize trainer
qlora_trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=qlora_training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
snapshots.append(memory_snapshot("After SFTTrainer init"))
print("Memory Tracking β QLoRA Pipeline")
print("=" * 60)
print(f"{'Stage':<35} {'Allocated':>10} {'Reserved':>10}")
print("-" * 60)
for s in snapshots:
print(f"{s['label']:<35} {s['allocated_gb']:>9.2f}G {s['reserved_gb']:>9.2f}G")
if torch.cuda.is_available():
total = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"\nTotal VRAM available: {total:.1f} GB")
current = torch.cuda.memory_reserved() / 1e9
print(f"Current usage : {current:.1f} GB ({current/total*100:.1f}%)")
print(f"Remaining free : {total-current:.1f} GB")
print()
print("QLoRA Trainer initialized.")
print(f"Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
import time
print("Starting QLoRA training...")
print("=" * 55)
start_time = time.time()
# Run training
train_result = qlora_trainer.train()
elapsed = time.time() - start_time
mem_peak = torch.cuda.max_memory_allocated() / 1e9 if torch.cuda.is_available() else 0.0
print()
print("QLoRA Training Complete!")
print("=" * 55)
print(f" Training time : {elapsed/60:.1f} minutes")
print(f" Final train loss : {train_result.training_loss:.4f}")
print(f" Peak GPU memory : {mem_peak:.2f} GB")
if hasattr(train_result.metrics, 'get'):
samples_per_sec = train_result.metrics.get('train_samples_per_second', 'N/A')
print(f" Samples/sec : {samples_per_sec}")
# Save the LoRA adapter (NOT the full model β just the small adapter weights)
ADAPTER_DIR = f"{OUTPUT_DIR}/lora-adapter"
print("Saving LoRA adapter...")
model.save_pretrained(ADAPTER_DIR) # saves only adapter weights (~MB, not GB)
tokenizer.save_pretrained(ADAPTER_DIR)
print(f"\nAdapter saved to: {ADAPTER_DIR}/")
if os.path.exists(ADAPTER_DIR):
files = sorted(os.listdir(ADAPTER_DIR))
total_mb = 0
for f in files:
fp = os.path.join(ADAPTER_DIR, f)
if os.path.isfile(fp):
size_mb = os.path.getsize(fp) / 1e6
total_mb += size_mb
print(f" {f:<50} {size_mb:>7.1f} MB")
print(f"\n Total adapter size: {total_mb:.1f} MB")
print(f" (Compare to full model: {sum(p.numel() for p in model.parameters()) * 2 / 1e9 * 1000:.0f} MB in BF16)")
print()
print("The adapter contains only the LoRA matrices (A and B).")
print("At inference time, load base model + adapter β no quality loss!")
# Inference with the QLoRA-trained adapter
from peft import PeftModel
print("Loading base model + LoRA adapter for inference...")
# Option A: Use the already-loaded model (adapter is already attached)
model.eval()
def generate_qlora(
instruction: str,
input_text: str = "",
max_new_tokens: int = 200,
temperature: float = 0.7,
) -> str:
"""Generate a response using the QLoRA fine-tuned model."""
if input_text.strip():
prompt = ALPACA_PROMPT_WITH_INPUT.format(
instruction=instruction,
input=input_text,
output="",
).rstrip()
else:
prompt = ALPACA_PROMPT_WITHOUT_INPUT.format(
instruction=instruction,
output="",
).rstrip()
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[1]
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=temperature,
do_sample=temperature > 0,
top_p=0.9,
repetition_penalty=1.1,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
new_tokens = outputs[0][input_len:]
return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
print("\n" + "=" * 60)
print("QLoRA Model Inference")
print("=" * 60)
test_cases = [
{"instruction": "Explain the difference between supervised and unsupervised learning.", "input": ""},
{"instruction": "Write a Python function to check if a number is prime.", "input": ""},
{"instruction": "Summarize the key points of this text.",
"input": "Transformers are deep learning models that use self-attention mechanisms to process sequential data. They were introduced in the paper 'Attention Is All You Need' in 2017 and have since become the dominant architecture for NLP tasks."},
]
for i, tc in enumerate(test_cases):
print(f"\n[Test {i+1}]")
print(f"Instruction: {tc['instruction']}")
if tc['input']:
print(f"Input: {tc['input'][:80]}...")
response = generate_qlora(tc["instruction"], tc.get("input", ""))
print(f"Response: {response}")
print("-" * 60)
9. Merging LoRA Adapters for DeploymentΒΆ
For production deployment, you have two options:
Option A: Serve Base + Adapter (no merge)ΒΆ
Keep the 4-bit base model and load the adapter at startup
Pros: Small adapter files, easy to swap adapters
Cons: Slightly slower inference (adapter overhead), requires PEFT library
Option B: Merge Adapter into Base ModelΒΆ
Calling merge_and_unload() mathematically merges W + BΒ·A back into the base model weights and removes the adapter overhead:
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
Pros: Standard model format, fastest inference, no PEFT dependency
Cons: Larger saved model, cannot swap adapters
Important note: After merging, the model is no longer quantized. The merged weights are in BF16 (the compute dtype). To re-quantize for efficient inference, run the merged model through bitsandbytes or llama.cpp.
Option C: Merge + Re-quantize with GGUF (for llama.cpp)ΒΆ
# 1. Merge adapter
python merge_adapter.py --base MODEL_ID --adapter ADAPTER_DIR --output merged/
# 2. Convert to GGUF
python llama.cpp/convert.py merged/ --outtype f16 --outfile model.gguf
# 3. Quantize GGUF
llama.cpp/quantize model.gguf model-q4_k_m.gguf Q4_K_M
# Demonstrate how to merge the LoRA adapter into the base model
# (This creates a standard model without adapter overhead)
MERGED_DIR = f"{OUTPUT_DIR}/merged-model"
print("Merging LoRA adapter into base model...")
print("Note: This upscales weights from NF4 to BF16.")
print()
# merge_and_unload() merges W + BΒ·A and removes LoRA layers
merged_model = model.merge_and_unload()
mem_merged, _ = get_gpu_memory_gb()
print(f"GPU memory after merge: {mem_merged:.2f} GB")
print("(Larger than 4-bit model β merged model is in BF16)")
print()
# Save merged model
print(f"Saving merged model to: {MERGED_DIR}/")
merged_model.save_pretrained(MERGED_DIR, safe_serialization=True)
tokenizer.save_pretrained(MERGED_DIR)
if os.path.exists(MERGED_DIR):
total_size = sum(
os.path.getsize(os.path.join(MERGED_DIR, f))
for f in os.listdir(MERGED_DIR)
if os.path.isfile(os.path.join(MERGED_DIR, f))
)
print(f"Merged model size: {total_size / 1e9:.2f} GB")
print()
print("Merged model can be loaded like any standard Hugging Face model:")
print(f"""
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"{MERGED_DIR}",
torch_dtype=torch.bfloat16,
device_map="auto",
)
# No PEFT library needed! Standard inference.
""")
# Clean up
del merged_model
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
10. Complete QLoRA Pipeline β Reference CodeΒΆ
The following cell contains the complete, minimal QLoRA fine-tuning pipeline in one place. Use this as a starting template for your own projects.
"""
Complete QLoRA Fine-tuning Pipeline β Reference Template
=========================================================
Copy this cell as a standalone script for your own projects.
Requirements:
pip install transformers trl peft bitsandbytes datasets accelerate
"""
# --- [0] Configuration ---
CONFIG = {
"model_id": "Qwen/Qwen2.5-1.5B-Instruct",
"dataset_id": "tatsu-lab/alpaca",
"output_dir": "./qlora-output",
# LoRA
"lora_r": 16,
"lora_alpha": 32,
"lora_dropout": 0.05,
"use_dora": False, # set True for better quality
"use_rslora": False, # set True for r >= 32
# Quantization
"quant_type": "nf4", # "nf4" or "fp4"
"double_quant": True,
# Training
"max_seq_length": 1024,
"batch_size": 2,
"grad_accum": 8,
"learning_rate": 2e-4,
"num_epochs": 1,
"warmup_ratio": 0.03,
"optimizer": "paged_adamw_8bit",
}
print("Complete QLoRA Pipeline β Configuration")
print("=" * 55)
for k, v in CONFIG.items():
print(f" {k:<22}: {v}")
print()
print("""
Pipeline steps (reference β already executed above):
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. Configure BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type=CONFIG["quant_type"],
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=CONFIG["double_quant"],
)
2. Load tokenizer + 4-bit quantized model
tokenizer = AutoTokenizer.from_pretrained(CONFIG["model_id"])
model = AutoModelForCausalLM.from_pretrained(
CONFIG["model_id"], quantization_config=bnb_config, device_map="auto"
)
3. Prepare for k-bit training
model = prepare_model_for_kbit_training(model)
4. Apply LoRA adapters
lora_cfg = LoraConfig(r=CONFIG["lora_r"], ...)
model = get_peft_model(model, lora_cfg)
5. Load and format dataset
dataset = load_dataset(CONFIG["dataset_id"])
dataset = dataset.map(format_alpaca)
6. Configure SFTTrainer and train
trainer = SFTTrainer(model, tokenizer, args, train_dataset, eval_dataset)
trainer.train()
7. Save adapter
model.save_pretrained("./adapter")
8. (Optional) Merge + save full model
merged = model.merge_and_unload()
merged.save_pretrained("./merged")
""")
Summary β Key TakeawaysΒΆ
QLoRA Architecture RecapΒΆ
QLoRA = NF4 quantized base (frozen) + LoRA adapters (trainable in BF16)
Memory savings vs full SFT:
7B model: ~86 GB β ~6 GB (14Γ reduction)
70B model: ~860 GB β ~40 GB (21Γ reduction)
When to Use Each PEFT MethodΒΆ
Scenario |
Method |
Why |
|---|---|---|
VRAM > 2Γ model BF16 size |
LoRA |
Faster, slightly better quality |
VRAM < 2Γ model BF16 size |
QLoRA |
Only option for large models |
Need best quality at same params |
DoRA |
+2-4% benchmark improvement |
Using high ranks (rβ₯32) |
RSLoRA |
Stable gradients, better convergence |
Best quality, enough VRAM |
DoRA+RSLoRA |
Combines both improvements |
Tiny GPU, minimal params |
IA3 / Prompt tuning |
<0.1% parameters |
QLoRA Best Practices ChecklistΒΆ
[x] Use NF4 quantization (not FP4) for transformer models
[x] Enable double quantization (saves ~3GB extra on 70B models)
[x] Set compute_dtype to bfloat16 (Ampere+) or float16 (older)
[x] Use prepare_model_for_kbit_training() before applying LoRA
[x] Use paged_adamw_8bit optimizer for minimal memory spikes
[x] Set LR = 2e-4 to 3e-4 (higher than full SFT)
[x] Set max_grad_norm = 0.3 (conservative for 4-bit)
[x] Enable gradient checkpointing (reduces activation memory)
[x] Enable sequence packing (2-3x throughput improvement)
[x] Target both attention AND MLP layers for best quality
[x] Consider DoRA=True for +2-4% quality with no extra memory
[x] Save only the adapter (~MB), not the full model (~GB)
Memory Quick ReferenceΒΆ
GPU VRAM β Maximum model size with QLoRA
4 GB β 1.5Bβ3B parameters
6 GB β 3Bβ7B parameters
8 GB β 7B parameters comfortably
12 GB β 7Bβ13B parameters
16 GB β 13B parameters
24 GB β 13Bβ30B parameters
40 GB β 30Bβ70B parameters
80 GB β 70B parameters (A100/H100 class)
Next StepsΒΆ
Notebook 05 β Instruction Tuning: curate better training datasets
Notebook 06 β Evaluation: measure model quality rigorously
Notebook 07 β RLHF/DPO: align fine-tuned model behavior
Notebook 08 β Deployment: serve your fine-tuned model in production
References:
QLoRA paper: Efficient Finetuning of Quantized LLMs β Dettmers et al., 2023
DoRA: Weight-Decomposed Low-Rank Adaptation β Liu et al., 2024
RSLoRA: Rank Stabilization Scaling β Kalajdzievski, 2023
LoRA: Low-Rank Adaptation of LLMs β Hu et al., 2021