LoRA Fine-tuning Basics (December 2025)ΒΆ

Complete guide to fine-tuning LLMs with LoRA/QLoRA using the latest models and techniques.

# Install required packages (December 2025 versions)
# !pip install -U transformers>=4.47.0 datasets>=3.2.0 peft>=0.13.0 accelerate>=1.2.0 bitsandbytes>=0.45.0 trl>=0.12.0
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer

print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {__import__('transformers').__version__}")
print(f"PEFT version: {__import__('peft').__version__}")
print(f"TRL version: {__import__('trl').__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

1. Load Model with 4-bit QuantizationΒΆ

QLoRA uses 4-bit quantization (NF4) to fit large models in limited memory.

Memory Requirements (December 2025):

  • 1.5B model: ~2GB (Qwen 2.5 1.5B)

  • 3B model: ~4GB (Llama 3.2 3B)

  • 7B model: ~6-8GB (Qwen 2.5 7B, Mistral)

  • 14B model: ~12-16GB (Phi-4, Qwen 2.5 14B)

  • 70B model: ~40-48GB (Llama 3.3, Qwen 2.5 72B)

# Model configuration (December 2025 - Latest Options)
model_name = "Qwen/Qwen2.5-1.5B-Instruct"  # Fast, high-quality for demo
# Production options (December 2025):
# - "Qwen/Qwen2.5-7B-Instruct" - Best balanced model
# - "meta-llama/Llama-3.2-3B-Instruct" - Meta's latest small model
# - "microsoft/phi-4" - Excellent for coding & reasoning
# - "meta-llama/Llama-3.3-70B-Instruct" - Best quality (needs 48GB)
# - "google/gemma-2-9b-it" - Google's instruction-tuned

# 4-bit quantization config (NF4 is best as of 2025)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NF4 optimal for LLMs
    bnb_4bit_compute_dtype=torch.bfloat16,  # BF16 better than FP16
    bnb_4bit_use_double_quant=True,      # Nested quantization
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,          # BF16 is standard now
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print(f"Model loaded: {model_name}")
print(f"Model size: {model.get_memory_footprint() / 1e9:.2f} GB")

2. Configure LoRAΒΆ

Only these parameters will be trained!

# LoRA configuration (2025 best practices)
lora_config = LoraConfig(
    r=64,                          # Rank (2025: 32-64 is standard, up from 8-16)
    lora_alpha=128,                # Scaling factor (2x rank)
    target_modules=[               # Target all attention + MLP for best results
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",               # NEW: Also adapt MLP layers
        "up_proj",
        "down_proj",
    ],
    lora_dropout=0.05,             # Dropout for regularization
    bias="none",                   # Don't train bias
    task_type="CAUSAL_LM",
    use_rslora=True,               # NEW: Rank-Stabilized LoRA (better)
)

# Prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

# Check trainable parameters
model.print_trainable_parameters()
# 2025: Typical ~2-5% trainable (up from <1% in 2023)

3. Prepare DatasetΒΆ

Let’s fine-tune for a specific task: code explanation.

# Sample dataset - replace with your own!
data = [
    {
        "input": "Explain this code: def factorial(n): return 1 if n <= 1 else n * factorial(n-1)",
        "output": "This is a recursive function that calculates factorial. It returns 1 for base cases (n <= 1), otherwise multiplies n by the factorial of (n-1)."
    },
    {
        "input": "Explain this code: lambda x: x**2",
        "output": "This is a lambda function (anonymous function) that takes a parameter x and returns its square (x**2)."
    },
    # Add more examples...
]

# Or load from Hugging Face (popular 2025 datasets)
# dataset = load_dataset("HuggingFaceH4/ultrachat_200k")  # Conversational
# dataset = load_dataset("teknium/OpenHermes-2.5")        # Instruction following
# dataset = load_dataset("mlabonne/FineTome-100k")        # High quality curated

# Format as instruction-following (ChatML format - 2025 standard)
def format_instruction(example):
    """Format data for instruction tuning using ChatML."""
    return {
        "text": f"""<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
{example['input']}<|im_end|>
<|im_start|>assistant
{example['output']}<|im_end|>"""
    }

# For demo, use a high-quality dataset
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:1000]")
print(f"Dataset size: {len(dataset)} examples")
print(f"\nSample:\n{dataset[0]['messages'][:200] if 'messages' in dataset[0] else dataset[0]}")

# Convert messages to text format if needed
if 'messages' in dataset[0]:
    def convert_to_text(example):
        messages = example['messages']
        text = ""
        for msg in messages:
            role = msg['role']
            content = msg['content']
            text += f"<|im_start|>{role}\n{content}<|im_end|>\n"
        return {"text": text}
    
    dataset = dataset.map(convert_to_text)

4. Training ConfigurationΒΆ

# Training arguments (2025 best practices)
training_args = TrainingArguments(
    output_dir="./lora-finetuned",
    num_train_epochs=3,                   # 2-3 epochs standard for 2025
    per_device_train_batch_size=4,        # Adjust based on GPU memory
    gradient_accumulation_steps=4,        # Effective batch = 4 * 4 = 16
    learning_rate=2e-4,                   # 2e-4 to 5e-4 for LoRA
    bf16=True,                            # BF16 is standard now (better than FP16)
    logging_steps=10,
    save_steps=100,
    save_total_limit=2,
    warmup_ratio=0.1,                     # 10% warmup (better than fixed steps)
    lr_scheduler_type="cosine",           # Cosine decay is best
    optim="paged_adamw_8bit",             # Memory-efficient optimizer
    gradient_checkpointing=True,          # Save memory
    report_to="none",                     # or "wandb" for tracking
    max_grad_norm=0.3,                    # Gradient clipping for stability
)

# Initialize trainer with 2025 features
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length=1024,                  # 2025: 1024-2048 typical (up from 512)
    packing=True,                         # Pack multiple examples per sequence
    dataset_kwargs={
        "add_special_tokens": False,      # Prevent duplicate special tokens
    }
)

print("Trainer initialized with 2025 best practices")

5. Train the ModelΒΆ

Training Time Estimates (December 2025):

  • 1.5B model: 5-15 minutes on RTX 4090

  • 7B model: 15-45 minutes on RTX 4090

  • 14B model: 30-90 minutes on A100 40GB

  • 70B model: 2-6 hours on A100 80GB

Adapter Size:

  • Typical: 20-100 MB (vs 3-140 GB for full model)

  • Much faster to share and deploy!

# Start training
print("Starting LoRA fine-tuning...")
print(f"Training {len(dataset)} examples for {training_args.num_train_epochs} epochs")
print(f"Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")

trainer.train()

# Save LoRA adapter (only 20-100 MB!)
output_dir = "./lora-adapter"
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"\nβœ… Training complete!")
print(f"πŸ“ LoRA adapter saved to {output_dir}")
print(f"πŸ’Ύ Adapter size: ~{sum(f.stat().st_size for f in __import__('pathlib').Path(output_dir).rglob('*') if f.is_file()) / 1e6:.1f} MB")

6. Test the Fine-tuned ModelΒΆ

# Load the fine-tuned model with adapter
from peft import PeftModel

print("Loading fine-tuned model...")

# Load base model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Load LoRA adapter on top
model = PeftModel.from_pretrained(base_model, "./lora-adapter")
model.eval()  # Set to evaluation mode

print("βœ… Model loaded with LoRA adapter")

def generate_response(prompt, max_new_tokens=256):
    """Generate response from fine-tuned model (2025 optimized)."""
    # Format with ChatML if needed
    if not prompt.startswith("<|im_start|>"):
        formatted_prompt = f"""<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
"""
    else:
        formatted_prompt = prompt
    
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
    
    with torch.inference_mode():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            repetition_penalty=1.1,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the assistant's response
    if "<|im_start|>assistant" in response:
        response = response.split("<|im_start|>assistant")[-1].split("<|im_end|>")[0].strip()
    
    return response

# Test the fine-tuned model
print("\n" + "="*60)
print("Testing fine-tuned model:")
print("="*60 + "\n")

test_prompt = "Explain what machine learning is in simple terms."
response = generate_response(test_prompt)

print(f"Prompt: {test_prompt}")
print(f"\nResponse:\n{response}")

7. Merge LoRA Adapter (Optional)ΒΆ

Merge adapter weights into base model for:

  • Faster inference (no adapter overhead)

  • Easier deployment (single model file)

  • Compatibility with standard inference engines

Trade-off: Merged model is full size (3-140 GB) vs adapter (20-100 MB)

# Merge adapter into base model
print("Merging LoRA adapter with base model...")

# First, load without quantization for merging
base_model_fp16 = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load adapter
model_with_adapter = PeftModel.from_pretrained(base_model_fp16, "./lora-adapter")

# Merge and unload adapter
merged_model = model_with_adapter.merge_and_unload()

# Save merged model
output_dir = "./merged-model"
merged_model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"βœ… Merged model saved to {output_dir}")
print(f"πŸ’Ύ Size: ~{sum(f.stat().st_size for f in __import__('pathlib').Path(output_dir).rglob('*') if f.is_file()) / 1e9:.2f} GB")
print("\nπŸ“Œ You can now load it like a standard model (no PEFT needed):")
print(f"   model = AutoModelForCausalLM.from_pretrained('{output_dir}')")

Key Parameters Explained (December 2025)ΒΆ

LoRA ParametersΒΆ

r (rank):

  • 2025 Standard: 32-64 (up from 8-16 in 2023)

  • Higher = more capacity but more parameters

  • Typical: 32, 64, 128

  • Start with 64 for best results

lora_alpha:

  • Scaling factor (usually 2x rank)

  • r=64 β†’ alpha=128

target_modules:

  • 2025: Target attention + MLP layers for best quality

  • All 7 modules: q/k/v/o_proj + gate/up/down_proj

  • More modules = better quality, slightly slower

use_rslora (NEW in 2025):

  • Rank-Stabilized LoRA

  • Better performance, especially with higher ranks

  • Always set to True

Training ParametersΒΆ

learning_rate:

  • LoRA: 2e-4 to 5e-4 (higher than full fine-tuning)

  • Start with 2e-4

  • QLoRA can use up to 1e-3

batch_size:

  • As large as GPU allows

  • Use gradient accumulation for effective larger batches

  • Effective batch 16-32 is good

epochs:

  • 2-3 for large datasets (>10k examples)

  • 5-10 for small datasets (<1k examples)

  • Watch for overfitting!

bf16 vs fp16:

  • 2025: BF16 is standard (better stability)

  • Use bf16=True instead of fp16=True

  • Better for numerical stability

packing:

  • 2025: Always use packing=True

  • 2-3x faster training

  • Better GPU utilization

Common Issues & Solutions (2025 Updated)ΒΆ

Out of Memory?ΒΆ

# Reduce batch size
per_device_train_batch_size=2

# Increase gradient accumulation
gradient_accumulation_steps=8

# Enable gradient checkpointing
gradient_checkpointing=True

# Reduce sequence length
max_seq_length=512

# Use smaller rank
r=32

# Use QLoRA instead of LoRA (even more efficient)

Model not improving?ΒΆ

# Try higher learning rate
learning_rate=5e-4

# More epochs
num_train_epochs=5

# Higher rank
r=128

# Add more target modules
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

# Enable RSLoRA
use_rslora=True

# Check your data formatting and quality!

Training too slow?ΒΆ

# Enable packing (2025: essential!)
packing=True

# Use Flash Attention 2 (if supported)
attn_implementation="flash_attention_2"

# Reduce target modules (trade-off: less quality)
target_modules=["q_proj", "v_proj"]

# Use lower precision
bf16=True  # or fp16=True

# Smaller max_seq_length
max_seq_length=512

Latest 2025 OptimizationsΒΆ

# DoRA (Weight-Decomposed LoRA) - better than LoRA
from peft import LoraConfig
lora_config = LoraConfig(
    use_dora=True,  # NEW: Better than standard LoRA
    r=32,
    ...
)

# Grouped Query Attention support
# Works automatically with Llama 3.x, Qwen 2.5, Phi-4

# LongLoRA for long context
lora_config = LoraConfig(
    r=64,
    target_modules="all-linear",  # Simplified targeting
    ...
)

Exercise: Fine-tune for Your TaskΒΆ

  1. Prepare your own dataset (minimum 100 examples)

  2. Format as instruction-response pairs

  3. Fine-tune with LoRA

  4. Compare before/after performance

# Your dataset here
my_data = [
    # {"input": "...", "output": "..."},
    # ...
]

# Convert to Hugging Face dataset format
from datasets import Dataset
my_dataset = Dataset.from_list(my_data)

# Apply formatting
my_dataset = my_dataset.map(format_instruction)

# Train!
# ...

Key Takeaways (December 2025)ΒΆ

  1. LoRA is still efficient: Train ~2-5% of parameters (up from <1% in 2023)

  2. Higher ranks are standard: r=64 is typical now (was r=8-16 in 2023)

  3. BF16 is the standard: Better than FP16 for stability

  4. RSLoRA improves quality: Use use_rslora=True

  5. DoRA is even better: Use use_dora=True for best results

  6. Packing is essential: 2-3x faster training

  7. Target all modules: Attention + MLP for best quality

  8. ChatML format: Standard for instruction tuning

  9. Quality data matters: 100 good examples > 1000 bad ones

  10. Latest models: Qwen 2.5, Llama 3.3, Phi-4, Gemma 2

2025 Model RecommendationsΒΆ

Best for Fine-tuning (December 2025):

  • Qwen/Qwen2.5-7B-Instruct - Best balanced (8GB)

  • meta-llama/Llama-3.2-3B-Instruct - Fast, quality (4GB)

  • microsoft/phi-4 - Excellent coding (16GB)

  • google/gemma-2-9b-it - Google’s best (12GB)

For Production:

  • Qwen/Qwen2.5-72B-Instruct - Best multilingual

  • meta-llama/Llama-3.3-70B-Instruct - Meta’s flagship

  • deepseek-ai/DeepSeek-R1-70B - Best reasoning

Next StepsΒΆ

  • 04_qlora_efficient.ipynb - Even more efficient with QLoRA

  • 05_dora_advanced.ipynb - DoRA for better quality (2025)

  • 06_instruction_tuning.ipynb - Advanced prompting formats

  • 07_evaluation.ipynb - Measure your fine-tuned model

  • 08_deployment.ipynb - Serve with vLLM or TGI

  • 09_unsloth.ipynb - 2x faster training with Unsloth (2025)