LoRA Fine-tuning Basics (December 2025)ΒΆ
Complete guide to fine-tuning LLMs with LoRA/QLoRA using the latest models and techniques.
# Install required packages (December 2025 versions)
# !pip install -U transformers>=4.47.0 datasets>=3.2.0 peft>=0.13.0 accelerate>=1.2.0 bitsandbytes>=0.45.0 trl>=0.12.0
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {__import__('transformers').__version__}")
print(f"PEFT version: {__import__('peft').__version__}")
print(f"TRL version: {__import__('trl').__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
1. Load Model with 4-bit QuantizationΒΆ
QLoRA uses 4-bit quantization (NF4) to fit large models in limited memory.
Memory Requirements (December 2025):
1.5B model: ~2GB (Qwen 2.5 1.5B)
3B model: ~4GB (Llama 3.2 3B)
7B model: ~6-8GB (Qwen 2.5 7B, Mistral)
14B model: ~12-16GB (Phi-4, Qwen 2.5 14B)
70B model: ~40-48GB (Llama 3.3, Qwen 2.5 72B)
# Model configuration (December 2025 - Latest Options)
model_name = "Qwen/Qwen2.5-1.5B-Instruct" # Fast, high-quality for demo
# Production options (December 2025):
# - "Qwen/Qwen2.5-7B-Instruct" - Best balanced model
# - "meta-llama/Llama-3.2-3B-Instruct" - Meta's latest small model
# - "microsoft/phi-4" - Excellent for coding & reasoning
# - "meta-llama/Llama-3.3-70B-Instruct" - Best quality (needs 48GB)
# - "google/gemma-2-9b-it" - Google's instruction-tuned
# 4-bit quantization config (NF4 is best as of 2025)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NF4 optimal for LLMs
bnb_4bit_compute_dtype=torch.bfloat16, # BF16 better than FP16
bnb_4bit_use_double_quant=True, # Nested quantization
)
# Load model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
torch_dtype=torch.bfloat16, # BF16 is standard now
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
print(f"Model loaded: {model_name}")
print(f"Model size: {model.get_memory_footprint() / 1e9:.2f} GB")
2. Configure LoRAΒΆ
Only these parameters will be trained!
# LoRA configuration (2025 best practices)
lora_config = LoraConfig(
r=64, # Rank (2025: 32-64 is standard, up from 8-16)
lora_alpha=128, # Scaling factor (2x rank)
target_modules=[ # Target all attention + MLP for best results
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj", # NEW: Also adapt MLP layers
"up_proj",
"down_proj",
],
lora_dropout=0.05, # Dropout for regularization
bias="none", # Don't train bias
task_type="CAUSAL_LM",
use_rslora=True, # NEW: Rank-Stabilized LoRA (better)
)
# Prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
# Check trainable parameters
model.print_trainable_parameters()
# 2025: Typical ~2-5% trainable (up from <1% in 2023)
3. Prepare DatasetΒΆ
Letβs fine-tune for a specific task: code explanation.
# Sample dataset - replace with your own!
data = [
{
"input": "Explain this code: def factorial(n): return 1 if n <= 1 else n * factorial(n-1)",
"output": "This is a recursive function that calculates factorial. It returns 1 for base cases (n <= 1), otherwise multiplies n by the factorial of (n-1)."
},
{
"input": "Explain this code: lambda x: x**2",
"output": "This is a lambda function (anonymous function) that takes a parameter x and returns its square (x**2)."
},
# Add more examples...
]
# Or load from Hugging Face (popular 2025 datasets)
# dataset = load_dataset("HuggingFaceH4/ultrachat_200k") # Conversational
# dataset = load_dataset("teknium/OpenHermes-2.5") # Instruction following
# dataset = load_dataset("mlabonne/FineTome-100k") # High quality curated
# Format as instruction-following (ChatML format - 2025 standard)
def format_instruction(example):
"""Format data for instruction tuning using ChatML."""
return {
"text": f"""<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
{example['input']}<|im_end|>
<|im_start|>assistant
{example['output']}<|im_end|>"""
}
# For demo, use a high-quality dataset
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:1000]")
print(f"Dataset size: {len(dataset)} examples")
print(f"\nSample:\n{dataset[0]['messages'][:200] if 'messages' in dataset[0] else dataset[0]}")
# Convert messages to text format if needed
if 'messages' in dataset[0]:
def convert_to_text(example):
messages = example['messages']
text = ""
for msg in messages:
role = msg['role']
content = msg['content']
text += f"<|im_start|>{role}\n{content}<|im_end|>\n"
return {"text": text}
dataset = dataset.map(convert_to_text)
4. Training ConfigurationΒΆ
# Training arguments (2025 best practices)
training_args = TrainingArguments(
output_dir="./lora-finetuned",
num_train_epochs=3, # 2-3 epochs standard for 2025
per_device_train_batch_size=4, # Adjust based on GPU memory
gradient_accumulation_steps=4, # Effective batch = 4 * 4 = 16
learning_rate=2e-4, # 2e-4 to 5e-4 for LoRA
bf16=True, # BF16 is standard now (better than FP16)
logging_steps=10,
save_steps=100,
save_total_limit=2,
warmup_ratio=0.1, # 10% warmup (better than fixed steps)
lr_scheduler_type="cosine", # Cosine decay is best
optim="paged_adamw_8bit", # Memory-efficient optimizer
gradient_checkpointing=True, # Save memory
report_to="none", # or "wandb" for tracking
max_grad_norm=0.3, # Gradient clipping for stability
)
# Initialize trainer with 2025 features
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_args,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=1024, # 2025: 1024-2048 typical (up from 512)
packing=True, # Pack multiple examples per sequence
dataset_kwargs={
"add_special_tokens": False, # Prevent duplicate special tokens
}
)
print("Trainer initialized with 2025 best practices")
5. Train the ModelΒΆ
Training Time Estimates (December 2025):
1.5B model: 5-15 minutes on RTX 4090
7B model: 15-45 minutes on RTX 4090
14B model: 30-90 minutes on A100 40GB
70B model: 2-6 hours on A100 80GB
Adapter Size:
Typical: 20-100 MB (vs 3-140 GB for full model)
Much faster to share and deploy!
# Start training
print("Starting LoRA fine-tuning...")
print(f"Training {len(dataset)} examples for {training_args.num_train_epochs} epochs")
print(f"Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
trainer.train()
# Save LoRA adapter (only 20-100 MB!)
output_dir = "./lora-adapter"
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"\nβ
Training complete!")
print(f"π LoRA adapter saved to {output_dir}")
print(f"πΎ Adapter size: ~{sum(f.stat().st_size for f in __import__('pathlib').Path(output_dir).rglob('*') if f.is_file()) / 1e6:.1f} MB")
6. Test the Fine-tuned ModelΒΆ
# Load the fine-tuned model with adapter
from peft import PeftModel
print("Loading fine-tuned model...")
# Load base model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Load LoRA adapter on top
model = PeftModel.from_pretrained(base_model, "./lora-adapter")
model.eval() # Set to evaluation mode
print("β
Model loaded with LoRA adapter")
def generate_response(prompt, max_new_tokens=256):
"""Generate response from fine-tuned model (2025 optimized)."""
# Format with ChatML if needed
if not prompt.startswith("<|im_start|>"):
formatted_prompt = f"""<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
"""
else:
formatted_prompt = prompt
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
top_p=0.9,
do_sample=True,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract just the assistant's response
if "<|im_start|>assistant" in response:
response = response.split("<|im_start|>assistant")[-1].split("<|im_end|>")[0].strip()
return response
# Test the fine-tuned model
print("\n" + "="*60)
print("Testing fine-tuned model:")
print("="*60 + "\n")
test_prompt = "Explain what machine learning is in simple terms."
response = generate_response(test_prompt)
print(f"Prompt: {test_prompt}")
print(f"\nResponse:\n{response}")
7. Merge LoRA Adapter (Optional)ΒΆ
Merge adapter weights into base model for:
Faster inference (no adapter overhead)
Easier deployment (single model file)
Compatibility with standard inference engines
Trade-off: Merged model is full size (3-140 GB) vs adapter (20-100 MB)
# Merge adapter into base model
print("Merging LoRA adapter with base model...")
# First, load without quantization for merging
base_model_fp16 = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Load adapter
model_with_adapter = PeftModel.from_pretrained(base_model_fp16, "./lora-adapter")
# Merge and unload adapter
merged_model = model_with_adapter.merge_and_unload()
# Save merged model
output_dir = "./merged-model"
merged_model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"β
Merged model saved to {output_dir}")
print(f"πΎ Size: ~{sum(f.stat().st_size for f in __import__('pathlib').Path(output_dir).rglob('*') if f.is_file()) / 1e9:.2f} GB")
print("\nπ You can now load it like a standard model (no PEFT needed):")
print(f" model = AutoModelForCausalLM.from_pretrained('{output_dir}')")
Key Parameters Explained (December 2025)ΒΆ
LoRA ParametersΒΆ
r (rank):
2025 Standard: 32-64 (up from 8-16 in 2023)
Higher = more capacity but more parameters
Typical: 32, 64, 128
Start with 64 for best results
lora_alpha:
Scaling factor (usually 2x rank)
r=64 β alpha=128
target_modules:
2025: Target attention + MLP layers for best quality
All 7 modules: q/k/v/o_proj + gate/up/down_proj
More modules = better quality, slightly slower
use_rslora (NEW in 2025):
Rank-Stabilized LoRA
Better performance, especially with higher ranks
Always set to True
Training ParametersΒΆ
learning_rate:
LoRA: 2e-4 to 5e-4 (higher than full fine-tuning)
Start with 2e-4
QLoRA can use up to 1e-3
batch_size:
As large as GPU allows
Use gradient accumulation for effective larger batches
Effective batch 16-32 is good
epochs:
2-3 for large datasets (>10k examples)
5-10 for small datasets (<1k examples)
Watch for overfitting!
bf16 vs fp16:
2025: BF16 is standard (better stability)
Use
bf16=Trueinstead offp16=TrueBetter for numerical stability
packing:
2025: Always use packing=True
2-3x faster training
Better GPU utilization
Common Issues & Solutions (2025 Updated)ΒΆ
Out of Memory?ΒΆ
# Reduce batch size
per_device_train_batch_size=2
# Increase gradient accumulation
gradient_accumulation_steps=8
# Enable gradient checkpointing
gradient_checkpointing=True
# Reduce sequence length
max_seq_length=512
# Use smaller rank
r=32
# Use QLoRA instead of LoRA (even more efficient)
Model not improving?ΒΆ
# Try higher learning rate
learning_rate=5e-4
# More epochs
num_train_epochs=5
# Higher rank
r=128
# Add more target modules
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
# Enable RSLoRA
use_rslora=True
# Check your data formatting and quality!
Training too slow?ΒΆ
# Enable packing (2025: essential!)
packing=True
# Use Flash Attention 2 (if supported)
attn_implementation="flash_attention_2"
# Reduce target modules (trade-off: less quality)
target_modules=["q_proj", "v_proj"]
# Use lower precision
bf16=True # or fp16=True
# Smaller max_seq_length
max_seq_length=512
Latest 2025 OptimizationsΒΆ
# DoRA (Weight-Decomposed LoRA) - better than LoRA
from peft import LoraConfig
lora_config = LoraConfig(
use_dora=True, # NEW: Better than standard LoRA
r=32,
...
)
# Grouped Query Attention support
# Works automatically with Llama 3.x, Qwen 2.5, Phi-4
# LongLoRA for long context
lora_config = LoraConfig(
r=64,
target_modules="all-linear", # Simplified targeting
...
)
Exercise: Fine-tune for Your TaskΒΆ
Prepare your own dataset (minimum 100 examples)
Format as instruction-response pairs
Fine-tune with LoRA
Compare before/after performance
# Your dataset here
my_data = [
# {"input": "...", "output": "..."},
# ...
]
# Convert to Hugging Face dataset format
from datasets import Dataset
my_dataset = Dataset.from_list(my_data)
# Apply formatting
my_dataset = my_dataset.map(format_instruction)
# Train!
# ...
Key Takeaways (December 2025)ΒΆ
LoRA is still efficient: Train ~2-5% of parameters (up from <1% in 2023)
Higher ranks are standard: r=64 is typical now (was r=8-16 in 2023)
BF16 is the standard: Better than FP16 for stability
RSLoRA improves quality: Use
use_rslora=TrueDoRA is even better: Use
use_dora=Truefor best resultsPacking is essential: 2-3x faster training
Target all modules: Attention + MLP for best quality
ChatML format: Standard for instruction tuning
Quality data matters: 100 good examples > 1000 bad ones
Latest models: Qwen 2.5, Llama 3.3, Phi-4, Gemma 2
2025 Model RecommendationsΒΆ
Best for Fine-tuning (December 2025):
Qwen/Qwen2.5-7B-Instruct- Best balanced (8GB)meta-llama/Llama-3.2-3B-Instruct- Fast, quality (4GB)microsoft/phi-4- Excellent coding (16GB)google/gemma-2-9b-it- Googleβs best (12GB)
For Production:
Qwen/Qwen2.5-72B-Instruct- Best multilingualmeta-llama/Llama-3.3-70B-Instruct- Metaβs flagshipdeepseek-ai/DeepSeek-R1-70B- Best reasoning
Next StepsΒΆ
04_qlora_efficient.ipynb- Even more efficient with QLoRA05_dora_advanced.ipynb- DoRA for better quality (2025)06_instruction_tuning.ipynb- Advanced prompting formats07_evaluation.ipynb- Measure your fine-tuned model08_deployment.ipynb- Serve with vLLM or TGI09_unsloth.ipynb- 2x faster training with Unsloth (2025)