Run this notebook: Open in Colab Open in Kaggle

LoRA Fine-tuning Basics (December 2025)¶

Complete guide to fine-tuning LLMs with LoRA/QLoRA using the latest models and techniques.

# Install required packages (December 2025 versions)
# !pip install -U transformers>=4.47.0 datasets>=3.2.0 peft>=0.13.0 accelerate>=1.2.0 bitsandbytes>=0.45.0 trl>=0.12.0

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer

print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {__import__('transformers').__version__}")
print(f"PEFT version: {__import__('peft').__version__}")
print(f"TRL version: {__import__('trl').__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

1. Load Model with 4-bit Quantization¶

QLoRA uses 4-bit quantization (NF4) to fit large models in limited memory.

Memory Requirements (December 2025):

1.5B model: ~2GB (Qwen 2.5 1.5B)
3B model: ~4GB (Llama 3.2 3B)
7B model: ~6-8GB (Qwen 2.5 7B, Mistral)
14B model: ~12-16GB (Phi-4, Qwen 2.5 14B)
70B model: ~40-48GB (Llama 3.3, Qwen 2.5 72B)

# Model configuration (December 2025 - Latest Options)
model_name = "Qwen/Qwen2.5-1.5B-Instruct"  # Fast, high-quality for demo
# Production options (December 2025):
# - "Qwen/Qwen2.5-7B-Instruct" - Best balanced model
# - "meta-llama/Llama-3.2-3B-Instruct" - Meta's latest small model
# - "microsoft/phi-4" - Excellent for coding & reasoning
# - "meta-llama/Llama-3.3-70B-Instruct" - Best quality (needs 48GB)
# - "google/gemma-2-9b-it" - Google's instruction-tuned

# 4-bit quantization config (NF4 is best as of 2025)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NF4 optimal for LLMs
    bnb_4bit_compute_dtype=torch.bfloat16,  # BF16 better than FP16
    bnb_4bit_use_double_quant=True,      # Nested quantization
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,          # BF16 is standard now
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print(f"Model loaded: {model_name}")
print(f"Model size: {model.get_memory_footprint() / 1e9:.2f} GB")

2. Configure LoRA¶

Only these parameters will be trained!

# LoRA configuration (2025 best practices)
lora_config = LoraConfig(
    r=64,                          # Rank (2025: 32-64 is standard, up from 8-16)
    lora_alpha=128,                # Scaling factor (2x rank)
    target_modules=[               # Target all attention + MLP for best results
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",               # NEW: Also adapt MLP layers
        "up_proj",
        "down_proj",
    ],
    lora_dropout=0.05,             # Dropout for regularization
    bias="none",                   # Don't train bias
    task_type="CAUSAL_LM",
    use_rslora=True,               # NEW: Rank-Stabilized LoRA (better)
)

# Prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

# Check trainable parameters
model.print_trainable_parameters()
# 2025: Typical ~2-5% trainable (up from <1% in 2023)

3. Prepare Dataset¶

Let’s fine-tune for a specific task: code explanation.

# Sample dataset - replace with your own!
data = [
    {
        "input": "Explain this code: def factorial(n): return 1 if n <= 1 else n * factorial(n-1)",
        "output": "This is a recursive function that calculates factorial. It returns 1 for base cases (n <= 1), otherwise multiplies n by the factorial of (n-1)."
    },
    {
        "input": "Explain this code: lambda x: x**2",
        "output": "This is a lambda function (anonymous function) that takes a parameter x and returns its square (x**2)."
    },
    # Add more examples...
]

# Or load from Hugging Face (popular 2025 datasets)
# dataset = load_dataset("HuggingFaceH4/ultrachat_200k")  # Conversational
# dataset = load_dataset("teknium/OpenHermes-2.5")        # Instruction following
# dataset = load_dataset("mlabonne/FineTome-100k")        # High quality curated

# Format as instruction-following (ChatML format - 2025 standard)
def format_instruction(example):
    """Format data for instruction tuning using ChatML."""
    return {
        "text": f"""<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
{example['input']}<|im_end|>
<|im_start|>assistant
{example['output']}<|im_end|>"""
    }

# For demo, use a high-quality dataset
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:1000]")
print(f"Dataset size: {len(dataset)} examples")
print(f"\nSample:\n{dataset[0]['messages'][:200] if 'messages' in dataset[0] else dataset[0]}")

# Convert messages to text format if needed
if 'messages' in dataset[0]:
    def convert_to_text(example):
        messages = example['messages']
        text = ""
        for msg in messages:
            role = msg['role']
            content = msg['content']
            text += f"<|im_start|>{role}\n{content}<|im_end|>\n"
        return {"text": text}
    
    dataset = dataset.map(convert_to_text)

4. Training Configuration¶

# Training arguments (2025 best practices)
training_args = TrainingArguments(
    output_dir="./lora-finetuned",
    num_train_epochs=3,                   # 2-3 epochs standard for 2025
    per_device_train_batch_size=4,        # Adjust based on GPU memory
    gradient_accumulation_steps=4,        # Effective batch = 4 * 4 = 16
    learning_rate=2e-4,                   # 2e-4 to 5e-4 for LoRA
    bf16=True,                            # BF16 is standard now (better than FP16)
    logging_steps=10,
    save_steps=100,
    save_total_limit=2,
    warmup_ratio=0.1,                     # 10% warmup (better than fixed steps)
    lr_scheduler_type="cosine",           # Cosine decay is best
    optim="paged_adamw_8bit",             # Memory-efficient optimizer
    gradient_checkpointing=True,          # Save memory
    report_to="none",                     # or "wandb" for tracking
    max_grad_norm=0.3,                    # Gradient clipping for stability
)

# Initialize trainer with 2025 features
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length=1024,                  # 2025: 1024-2048 typical (up from 512)
    packing=True,                         # Pack multiple examples per sequence
    dataset_kwargs={
        "add_special_tokens": False,      # Prevent duplicate special tokens
    }
)

print("Trainer initialized with 2025 best practices")

5. Train the Model¶

Training Time Estimates (December 2025):

1.5B model: 5-15 minutes on RTX 4090
7B model: 15-45 minutes on RTX 4090
14B model: 30-90 minutes on A100 40GB
70B model: 2-6 hours on A100 80GB

Adapter Size:

Typical: 20-100 MB (vs 3-140 GB for full model)
Much faster to share and deploy!

# Start training
print("Starting LoRA fine-tuning...")
print(f"Training {len(dataset)} examples for {training_args.num_train_epochs} epochs")
print(f"Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")

trainer.train()

# Save LoRA adapter (only 20-100 MB!)
output_dir = "./lora-adapter"
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"\n✅ Training complete!")
print(f"📁 LoRA adapter saved to {output_dir}")
print(f"💾 Adapter size: ~{sum(f.stat().st_size for f in __import__('pathlib').Path(output_dir).rglob('*') if f.is_file()) / 1e6:.1f} MB")

6. Test the Fine-tuned Model¶

# Load the fine-tuned model with adapter
from peft import PeftModel

print("Loading fine-tuned model...")

# Load base model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Load LoRA adapter on top
model = PeftModel.from_pretrained(base_model, "./lora-adapter")
model.eval()  # Set to evaluation mode

print("✅ Model loaded with LoRA adapter")

def generate_response(prompt, max_new_tokens=256):
    """Generate response from fine-tuned model (2025 optimized)."""
    # Format with ChatML if needed
    if not prompt.startswith("<|im_start|>"):
        formatted_prompt = f"""<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
"""
    else:
        formatted_prompt = prompt
    
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
    
    with torch.inference_mode():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            repetition_penalty=1.1,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the assistant's response
    if "<|im_start|>assistant" in response:
        response = response.split("<|im_start|>assistant")[-1].split("<|im_end|>")[0].strip()
    
    return response

# Test the fine-tuned model
print("\n" + "="*60)
print("Testing fine-tuned model:")
print("="*60 + "\n")

test_prompt = "Explain what machine learning is in simple terms."
response = generate_response(test_prompt)

print(f"Prompt: {test_prompt}")
print(f"\nResponse:\n{response}")

7. Merge LoRA Adapter (Optional)¶

Merge adapter weights into base model for:

Faster inference (no adapter overhead)
Easier deployment (single model file)
Compatibility with standard inference engines

Trade-off: Merged model is full size (3-140 GB) vs adapter (20-100 MB)

# Merge adapter into base model
print("Merging LoRA adapter with base model...")

# First, load without quantization for merging
base_model_fp16 = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load adapter
model_with_adapter = PeftModel.from_pretrained(base_model_fp16, "./lora-adapter")

# Merge and unload adapter
merged_model = model_with_adapter.merge_and_unload()

# Save merged model
output_dir = "./merged-model"
merged_model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"✅ Merged model saved to {output_dir}")
print(f"💾 Size: ~{sum(f.stat().st_size for f in __import__('pathlib').Path(output_dir).rglob('*') if f.is_file()) / 1e9:.2f} GB")
print("\n📌 You can now load it like a standard model (no PEFT needed):")
print(f"   model = AutoModelForCausalLM.from_pretrained('{output_dir}')")

Key Parameters Explained (December 2025)¶

LoRA Parameters¶

r (rank):

2025 Standard: 32-64 (up from 8-16 in 2023)
Higher = more capacity but more parameters
Typical: 32, 64, 128
Start with 64 for best results

lora_alpha:

Scaling factor (usually 2x rank)
r=64 → alpha=128

target_modules:

2025: Target attention + MLP layers for best quality
All 7 modules: q/k/v/o_proj + gate/up/down_proj
More modules = better quality, slightly slower

use_rslora (NEW in 2025):

Rank-Stabilized LoRA
Better performance, especially with higher ranks
Always set to True

Training Parameters¶

learning_rate:

LoRA: 2e-4 to 5e-4 (higher than full fine-tuning)
Start with 2e-4
QLoRA can use up to 1e-3

batch_size:

As large as GPU allows
Use gradient accumulation for effective larger batches
Effective batch 16-32 is good

epochs:

2-3 for large datasets (>10k examples)
5-10 for small datasets (<1k examples)
Watch for overfitting!

bf16 vs fp16:

2025: BF16 is standard (better stability)
Use bf16=True instead of fp16=True
Better for numerical stability

packing:

2025: Always use packing=True
2-3x faster training
Better GPU utilization

Common Issues & Solutions (2025 Updated)¶

Out of Memory?¶

# Reduce batch size
per_device_train_batch_size=2

# Increase gradient accumulation
gradient_accumulation_steps=8

# Enable gradient checkpointing
gradient_checkpointing=True

# Reduce sequence length
max_seq_length=512

# Use smaller rank
r=32

# Use QLoRA instead of LoRA (even more efficient)

Model not improving?¶

# Try higher learning rate
learning_rate=5e-4

# More epochs
num_train_epochs=5

# Higher rank
r=128

# Add more target modules
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

# Enable RSLoRA
use_rslora=True

# Check your data formatting and quality!

Training too slow?¶

# Enable packing (2025: essential!)
packing=True

# Use Flash Attention 2 (if supported)
attn_implementation="flash_attention_2"

# Reduce target modules (trade-off: less quality)
target_modules=["q_proj", "v_proj"]

# Use lower precision
bf16=True  # or fp16=True

# Smaller max_seq_length
max_seq_length=512

Latest 2025 Optimizations¶

# DoRA (Weight-Decomposed LoRA) - better than LoRA
from peft import LoraConfig
lora_config = LoraConfig(
    use_dora=True,  # NEW: Better than standard LoRA
    r=32,
    ...
)

# Grouped Query Attention support
# Works automatically with Llama 3.x, Qwen 2.5, Phi-4

# LongLoRA for long context
lora_config = LoraConfig(
    r=64,
    target_modules="all-linear",  # Simplified targeting
    ...
)

Exercise: Fine-tune for Your Task¶

Prepare your own dataset (minimum 100 examples)
Format as instruction-response pairs
Fine-tune with LoRA
Compare before/after performance

# Your dataset here
my_data = [
    # {"input": "...", "output": "..."},
    # ...
]

# Convert to Hugging Face dataset format
from datasets import Dataset
my_dataset = Dataset.from_list(my_data)

# Apply formatting
my_dataset = my_dataset.map(format_instruction)

# Train!
# ...

Key Takeaways (December 2025)¶

LoRA is still efficient: Train ~2-5% of parameters (up from <1% in 2023)
Higher ranks are standard: r=64 is typical now (was r=8-16 in 2023)
BF16 is the standard: Better than FP16 for stability
RSLoRA improves quality: Use use_rslora=True
DoRA is even better: Use use_dora=True for best results
Packing is essential: 2-3x faster training
Target all modules: Attention + MLP for best quality
ChatML format: Standard for instruction tuning
Quality data matters: 100 good examples > 1000 bad ones
Latest models: Qwen 2.5, Llama 3.3, Phi-4, Gemma 2

2025 Model Recommendations¶

Best for Fine-tuning (December 2025):

Qwen/Qwen2.5-7B-Instruct - Best balanced (8GB)
meta-llama/Llama-3.2-3B-Instruct - Fast, quality (4GB)
microsoft/phi-4 - Excellent coding (16GB)
google/gemma-2-9b-it - Google’s best (12GB)

For Production:

Qwen/Qwen2.5-72B-Instruct - Best multilingual
meta-llama/Llama-3.3-70B-Instruct - Meta’s flagship
deepseek-ai/DeepSeek-R1-70B - Best reasoning

Next Steps¶

04_qlora_efficient.ipynb - Even more efficient with QLoRA
05_dora_advanced.ipynb - DoRA for better quality (2025)
06_instruction_tuning.ipynb - Advanced prompting formats
07_evaluation.ipynb - Measure your fine-tuned model
08_deployment.ipynb - Serve with vLLM or TGI
09_unsloth.ipynb - 2x faster training with Unsloth (2025)