Unsloth - 2x-5x Faster Fine-Tuning with 80% Less VRAM (2025)ΒΆ
Unsloth is the fastest open-source LLM fine-tuning library in 2025. It achieves its speedups through hand-written custom CUDA kernels - not approximations - so you get the same model quality as standard TRL/PEFT, just much faster and cheaper.
What You Will LearnΒΆ
Why Unsloth is faster (custom kernels, not tricks)
Loading any 2025 model (Qwen 3, Llama 4, Phi-4, Gemma 3) with Unsloth
Dynamic 4-bit quantization: Unsloth Dynamic 2.0
Full SFT pipeline with UnslothTrainer
GRPO with Unsloth for 5x faster reasoning training
Extended context: 342K token training on Llama 3.1 8B
Saving to GGUF for Ollama/llama.cpp deployment
Memory and speed benchmarks vs standard TRL
Free tier strategies: Colab T4 and Kaggle P100
PrerequisitesΒΆ
Completed notebooks 02-05 (SFT, LoRA, QLoRA, DPO)
Notebook 08 (GRPO) recommended for the GRPO section
GPU with 8-16 GB VRAM (or free Colab/Kaggle)
# Install Unsloth (2025 - supports CUDA 12.x)
# !pip install unsloth
#
# For specific CUDA versions:
# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" # Colab
# !pip install unsloth[cu121] # CUDA 12.1
# !pip install unsloth[cu124] # CUDA 12.4+
#
# Dependencies (usually auto-installed):
# !pip install xformers trl>=0.12.0 peft>=0.13.0 accelerate>=1.2.0
import torch
import sys
print(f"Python: {sys.version.split()[0]}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
gpu_name = torch.cuda.get_device_name(0)
gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"GPU: {gpu_name}")
print(f"VRAM: {gpu_mem:.1f} GB")
# Check if Unsloth supports this GPU
if "T4" in gpu_name:
print("Free tier GPU detected (Colab T4). Unsloth will use 16GB efficiently.")
elif "P100" in gpu_name:
print("Free tier GPU detected (Kaggle P100). Unsloth works great here.")
elif "A100" in gpu_name or "A10" in gpu_name:
print("Production GPU. Unsloth will provide maximum speedup.")
elif "4090" in gpu_name or "3090" in gpu_name:
print("Consumer flagship GPU. Unsloth excels here.")
else:
print("GPU detected. Unsloth supports most NVIDIA GPUs.")
else:
print("No GPU detected. Unsloth requires a CUDA GPU.")
# Try importing Unsloth
try:
from unsloth import FastLanguageModel
print("\nUnsloth imported successfully!")
except ImportError:
print("\nUnsloth not installed. Run: pip install unsloth")
print("Showing code examples that will work once installed.")
1. What Is Unsloth and Why It MattersΒΆ
The Standard Fine-Tuning BottleneckΒΆ
Standard QLoRA (bitsandbytes + PEFT + TRL) is great but has overhead:
HuggingFace attention uses generic PyTorch - not optimized for your exact hardware
Gradient checkpointing has Python-level overhead per operation
The LoRA matrix math uses standard cuBLAS calls
How Unsloth Is DifferentΒΆ
Unsloth rewrites the critical path in custom CUDA/Triton kernels:
Custom attention kernel: RoPE, attention computation, and softmax fused into one kernel
Fused LoRA operations: The A/B matrix multiplications and addition are fused
Custom gradient checkpointing: Python overhead removed, pure CUDA
Smarter memory management: Avoids unnecessary tensor copies during backprop
This is not approximation. It is the exact same math, done more efficiently.
Key Advantages Over CompetitorsΒΆ
Feature |
Unsloth |
Standard TRL |
Axolotl |
|---|---|---|---|
Speed (vs baseline) |
2-5x faster |
1x (baseline) |
~1.5x |
Memory reduction |
60-80% less |
baseline |
~20% |
Accuracy vs baseline |
Identical |
Identical |
Identical |
AMD GPU support |
Yes (ROCm 2025) |
Yes |
Yes |
Intel GPU support |
Yes (2025) |
Partial |
No |
GGUF export |
Built-in |
Manual |
Manual |
Setup complexity |
Low (pip install) |
Medium |
High (YAML config) |
Free tier compatible |
Yes |
Partial |
No |
AMD and Intel GPU Support (2025)ΒΆ
Unsloth 2025 expanded beyond NVIDIA:
AMD ROCm 6.x: RX 7900 XTX, MI300X supported
Intel Arc / Gaudi 3: via Intel Extension for PyTorch
Same API, kernels automatically dispatch to the right backend
2. Loading Models with UnslothΒΆ
FastLanguageModel.from_pretrained is a drop-in replacement for
AutoModelForCausalLM.from_pretrained + BitsAndBytesConfig.
It handles quantization, device mapping, and Unsloth kernel patching in one call.
# ============================================================
# Loading models with Unsloth (2025 supported models)
# ============================================================
# Unsloth hosts pre-patched versions of popular models on HuggingFace
# using the "unsloth/" namespace - these load faster than the originals.
SUPPORTED_MODELS_2025 = {
# Qwen 2.5 family (recommended for most tasks)
"qwen2.5_1.5b": "unsloth/Qwen2.5-1.5B-Instruct",
"qwen2.5_7b": "unsloth/Qwen2.5-7B-Instruct",
"qwen2.5_14b": "unsloth/Qwen2.5-14B-Instruct",
"qwen2.5_72b": "unsloth/Qwen2.5-72B-Instruct-bnb-4bit",
# Qwen 3 family (2025 - best reasoning + multilingual)
"qwen3_4b": "unsloth/Qwen3-4B",
"qwen3_8b": "unsloth/Qwen3-8B",
"qwen3_30b": "unsloth/Qwen3-30B-A3B", # MoE: 30B total, 3B active
# Llama 3.x family
"llama3.2_1b": "unsloth/Llama-3.2-1B-Instruct",
"llama3.2_3b": "unsloth/Llama-3.2-3B-Instruct",
"llama3.3_70b": "unsloth/Llama-3.3-70B-Instruct-bnb-4bit",
# Llama 4 (2025)
"llama4_scout": "unsloth/Llama-4-Scout-17B-16E", # 17B active of 109B
"llama4_maverick": "unsloth/Llama-4-Maverick-17B-128E",
# Phi-4 (Microsoft)
"phi4": "unsloth/phi-4",
"phi4_mini": "unsloth/Phi-4-mini-instruct",
# Gemma 3 (Google, 2025)
"gemma3_4b": "unsloth/gemma-3-4b-it",
"gemma3_12b": "unsloth/gemma-3-12b-it",
"gemma3_27b": "unsloth/gemma-3-27b-it",
# DeepSeek R1 family (reasoning)
"deepseek_r1_7b": "unsloth/DeepSeek-R1-Distill-Qwen-7B",
"deepseek_r1_70b": "unsloth/DeepSeek-R1-Distill-Llama-70B-bnb-4bit",
}
print("Unsloth-supported models (2025):")
print()
for key, model_id in SUPPORTED_MODELS_2025.items():
print(f" {key:<20} -> {model_id}")
# ============================================================
# Standard Unsloth model loading
# ============================================================
# from unsloth import FastLanguageModel
MODEL_NAME = "unsloth/Qwen2.5-7B-Instruct"
MAX_SEQ_LENGTH = 2048
UNSLOTH_LOAD_CODE = f'''
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="{MODEL_NAME}",
max_seq_length={MAX_SEQ_LENGTH}, # Context window
load_in_4bit=True, # 4-bit quantization
# load_in_8bit=False, # 8-bit (slower, more accurate)
# full_finetuning=False, # Set True for full fine-tuning
dtype=None, # Auto-detect (BF16 on Ampere+)
# token="hf_...", # For gated models (Llama 4)
)
print(f"Model loaded: {MODEL_NAME}")
print(f"Memory: {{model.get_memory_footprint() / 1e9:.2f}} GB")
# 7B with 4-bit: ~5 GB (vs 14 GB for FP16)
'''
print("Standard Unsloth loading code:")
print(UNSLOTH_LOAD_CODE)
# Demonstrate standard loading with bitsandbytes for comparison
print("\n" + "="*60)
print("Standard HuggingFace equivalent (for comparison):")
print("="*60)
HF_EQUIVALENT = """
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct",
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Then separately: prepare_model_for_kbit_training, get_peft_model...
# Unsloth combines all of this into one call
"""
print(HF_EQUIVALENT)
3. Unsloth Dynamic 2.0 - Smarter QuantizationΒΆ
Standard NF4 quantization applies 4-bit compression uniformly to all layers. Unsloth Dynamic 2.0 (released 2025) is smarter:
How Dynamic Quantization WorksΒΆ
Not all layers are equally sensitive to quantization:
Embedding layers: Very sensitive. 4-bit here hurts accuracy significantly.
First/last transformer layers: More sensitive than middle layers.
Attention vs MLP: Attention layers are more sensitive.
Unsloth Dynamic 2.0 keeps sensitive layers in higher precision (8-bit or 16-bit) and quantizes insensitive layers to 4-bit. The result:
Approach |
Accuracy |
Memory |
Speed |
|---|---|---|---|
BF16 (full precision) |
Baseline |
100% |
1x |
Uniform NF4 4-bit |
-2.5% avg |
35% |
2x |
Unsloth Dynamic 2.0 |
-0.8% avg |
38% |
1.9x |
Dynamic 2.0 gives nearly full-precision accuracy at 4-bit memory cost.
Activating it is one parameter change:
# Dynamic 2.0 quantization example
DYNAMIC_QUANT_CODE = """
from unsloth import FastLanguageModel
# Option 1: Unsloth Dynamic 2.0 (best accuracy at 4-bit)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-Instruct-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
# Unsloth Dynamic 2.0: automatically skips critical layers
# Enable by using the 'bnb-4bit' model variants on unsloth/ namespace
)
# Option 2: Unsloth's GGUF-based quantization (best for inference)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-Instruct-GGUF",
max_seq_length=2048,
# gguf_quant_type="Q4_K_M", # GGUF quantization type
)
# Option 3: Full precision (best quality, most memory)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-Instruct",
max_seq_length=2048,
load_in_4bit=False, # BF16 full precision
full_finetuning=True, # Enable full parameter fine-tuning
)
"""
print(DYNAMIC_QUANT_CODE)
# Memory comparison table
import pandas as pd
memory_data = {
"Model": ["1.5B", "7B", "14B", "32B", "70B"],
"BF16 (full)": ["3 GB", "14 GB", "28 GB", "64 GB", "140 GB"],
"NF4 4-bit (bnb)": ["1 GB", "5 GB", "10 GB", "20 GB", "40 GB"],
"Unsloth Dynamic 2.0": ["1.1 GB", "5.2 GB", "10.5 GB", "21 GB", "42 GB"],
"With LoRA gradient": ["3 GB", "10 GB", "18 GB", "38 GB", "65 GB"],
"Free tier fits?": ["T4 yes", "T4 yes", "A10 yes", "A100 only", "2xA100"],
}
df = pd.DataFrame(memory_data)
print("\nUnsloth Memory Requirements (training with LoRA):")
print(df.to_string(index=False))
4. LoRA Configuration with UnslothΒΆ
FastLanguageModel.get_peft_model replaces the PEFT get_peft_model + LoraConfig pattern.
It applies Unslothβs custom LoRA kernels automatically.
# LoRA with Unsloth - full example with explanations
UNSLOTH_LORA_CODE = """
from unsloth import FastLanguageModel
# Step 1: Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
)
# Step 2: Apply LoRA
model = FastLanguageModel.get_peft_model(
model,
# LoRA hyperparameters (2025 best practices)
r=64, # Rank. 64 is standard in 2025.
lora_alpha=128, # Scaling (2x rank)
target_modules=[
"q_proj", "k_proj", # Query and Key attention
"v_proj", "o_proj", # Value and Output attention
"gate_proj", "up_proj", # MLP layers
"down_proj",
],
# OR use 'all-linear' to target all linear layers automatically:
# target_modules="all-linear",
lora_dropout=0.0, # Unsloth recommends 0 dropout
bias="none",
use_rslora=True, # Rank-Stabilized LoRA
use_gradient_checkpointing="unsloth", # Unsloth's custom checkpointing
random_state=42,
# Optional: LoRA for longer context
# loftq_config=None, # LoftQ initialization
)
# Check parameters
model.print_trainable_parameters()
# Example output: trainable params: 39,976,960 || all params: 7,241,732,096 || trainable%: 0.55
"""
print("Unsloth LoRA setup:")
print(UNSLOTH_LORA_CODE)
# Compare LoRA setup: Standard PEFT vs Unsloth
STANDARD_LORA = """
# Standard PEFT approach (more verbose):
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
lora_config = LoraConfig(
r=64,
lora_alpha=128,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
use_rslora=True,
)
model = prepare_model_for_kbit_training(model) # Extra step needed
model = get_peft_model(model, lora_config)
# Gradient checkpointing must be set separately
model.gradient_checkpointing_enable()
"""
print("\nStandard PEFT (for comparison):")
print(STANDARD_LORA)
5. Full SFT Pipeline with UnslothTrainerΒΆ
Unslothβs UnslothTrainer is a drop-in replacement for TRLβs SFTTrainer.
Every feature of SFTTrainer works, plus Unslothβs speedups apply automatically.
from datasets import load_dataset
# Load a dataset for demonstration
print("Loading dataset...")
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:2000]")
print(f"Loaded {len(dataset)} training examples")
print(f"Columns: {dataset.column_names}")
print(f"\nExample:\n{str(dataset[0]['messages'][:2])[:200]}...")
# Format dataset for instruction tuning
def convert_messages_to_text(example):
"""
Convert chat messages to ChatML format text.
Unsloth's SFTTrainer can also handle raw messages with apply_chat_template.
"""
text = ""
for msg in example["messages"]:
role = msg["role"]
content = msg["content"]
text += f"<|im_start|>{role}\n{content}<|im_end|>\n"
return {"text": text}
formatted_dataset = dataset.map(
convert_messages_to_text,
remove_columns=dataset.column_names,
)
print(f"Formatted dataset: {len(formatted_dataset)} examples")
print(f"\nSample (first 300 chars):\n{formatted_dataset[0]['text'][:300]}...")
# Complete Unsloth SFT training pipeline
UNSLOTH_SFT_FULL = """
# ============================================================
# Complete Unsloth SFT Pipeline
# ============================================================
from unsloth import FastLanguageModel
from unsloth import UnslothTrainer, UnslothTrainingArguments
from trl import SFTTrainer # Or use UnslothTrainer for auto-optimizations
from transformers import TrainingArguments
from datasets import load_dataset
# 1. Load model with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
)
# 2. Apply LoRA
model = FastLanguageModel.get_peft_model(
model,
r=64,
lora_alpha=128,
target_modules="all-linear",
use_rslora=True,
use_gradient_checkpointing="unsloth",
random_state=42,
)
# 3. Training arguments
training_args = TrainingArguments(
output_dir="./unsloth-sft-output",
num_train_epochs=3,
per_device_train_batch_size=4, # Unsloth allows larger batches
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
optim="adamw_8bit", # Unsloth's 8-bit Adam
lr_scheduler_type="cosine",
warmup_ratio=0.1,
logging_steps=10,
save_steps=100,
save_total_limit=2,
report_to="none",
max_grad_norm=1.0,
)
# 4. Trainer (SFTTrainer works with Unsloth model)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=formatted_dataset,
dataset_text_field="text",
max_seq_length=2048,
packing=True, # Essential for speed
args=training_args,
)
# 5. Show memory before training
gpu_stats = torch.cuda.get_device_properties(0)
start_mem = torch.cuda.max_memory_reserved() / 1e9
max_mem = gpu_stats.total_memory / 1e9
print(f"GPU: {gpu_stats.name} ({max_mem:.1f} GB)")
print(f"Reserved: {start_mem:.2f} GB / {max_mem:.1f} GB")
# 6. Train!
trainer_stats = trainer.train()
# 7. Report stats
end_mem = torch.cuda.max_memory_reserved() / 1e9
print(f"\nTraining complete!")
print(f"Time: {trainer_stats.metrics['train_runtime']:.1f}s")
print(f"Tokens/sec: {trainer_stats.metrics['train_samples_per_second']:.1f}")
print(f"Peak memory used: {end_mem:.2f} GB / {max_mem:.1f} GB")
"""
print(UNSLOTH_SFT_FULL)
6. Speed and Memory BenchmarksΒΆ
Here are real-world benchmarks comparing Unsloth, standard TRL, and Axolotl. Numbers are from Unslothβs official benchmarks and community reproductions (2025).
import matplotlib.pyplot as plt
import numpy as np
# Benchmark data: 7B model SFT, batch_size=1, seq_len=2048, A100 80GB
frameworks = ["Standard TRL", "Axolotl", "Unsloth (LoRA)", "Unsloth (Full)"]
# Tokens per second (higher = faster)
tokens_per_sec = [1850, 2100, 5200, 4800]
# Peak GPU memory in GB (lower = better)
peak_memory_gb = [38.2, 32.5, 7.8, 14.2]
colors = ["#95a5a6", "#3498db", "#2ecc71", "#27ae60"]
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle("Unsloth vs Alternatives - 7B Model SFT (A100 80GB)", fontsize=13)
# Speed comparison
bars1 = axes[0].bar(frameworks, tokens_per_sec, color=colors, edgecolor="white", linewidth=1.5)
axes[0].set_title("Training Speed")
axes[0].set_ylabel("Tokens per Second")
axes[0].set_ylim(0, 6500)
for bar, val in zip(bars1, tokens_per_sec):
speedup = val / tokens_per_sec[0]
axes[0].text(
bar.get_x() + bar.get_width() / 2,
bar.get_height() + 100,
f"{val:,}\n({speedup:.1f}x)",
ha="center",
va="bottom",
fontsize=9,
fontweight="bold",
)
axes[0].tick_params(axis="x", rotation=20)
# Memory comparison
bars2 = axes[1].bar(frameworks, peak_memory_gb, color=colors, edgecolor="white", linewidth=1.5)
axes[1].set_title("Peak GPU Memory")
axes[1].set_ylabel("GB")
axes[1].set_ylim(0, 48)
axes[1].axhline(y=16, color="red", linestyle="--", alpha=0.7, label="Free tier GPU limit (16GB)")
axes[1].axhline(y=40, color="orange", linestyle="--", alpha=0.5, label="A100 40GB")
for bar, val in zip(bars2, peak_memory_gb):
reduction = (peak_memory_gb[0] - val) / peak_memory_gb[0] * 100
label = f"{val:.1f} GB\n(-{reduction:.0f}%)" if reduction > 0 else f"{val:.1f} GB"
axes[1].text(
bar.get_x() + bar.get_width() / 2,
bar.get_height() + 0.5,
label,
ha="center",
va="bottom",
fontsize=9,
fontweight="bold",
)
axes[1].legend(fontsize=8)
axes[1].tick_params(axis="x", rotation=20)
plt.tight_layout()
plt.savefig("unsloth_benchmarks.png", dpi=120, bbox_inches="tight")
plt.show()
print("Key insight: Unsloth LoRA uses only 7.8 GB for 7B model training.")
print("This fits on a free Colab T4 (16 GB). Standard TRL needs 38.2 GB!")
# Cost comparison for fine-tuning a 7B model
print("7B Model Fine-Tuning Cost Comparison (2025 cloud GPU prices)")
print("=" * 70)
print()
scenarios = [
{
"name": "Standard TRL on A100 40GB",
"gpu": "A100 40GB",
"hours": 4.2,
"cost_per_hr": 3.67, # Lambda Labs A100
"notes": "Minimum hardware that fits",
},
{
"name": "Axolotl on A10G 24GB",
"gpu": "A10G 24GB",
"hours": 3.1,
"cost_per_hr": 1.10, # AWS g5.xlarge
"notes": "Just barely fits",
},
{
"name": "Unsloth on RTX 4090 (24GB)",
"gpu": "RTX 4090",
"hours": 1.8,
"cost_per_hr": 0.74, # Vast.ai RTX 4090
"notes": "Consumer GPU works!",
},
{
"name": "Unsloth on L4 (24GB)",
"gpu": "L4 24GB",
"hours": 2.1,
"cost_per_hr": 0.81, # Google Cloud L4
"notes": "Good price/performance",
},
{
"name": "Unsloth on Colab T4 (FREE)",
"gpu": "T4 16GB",
"hours": 3.5,
"cost_per_hr": 0.0,
"notes": "Free! (limited hours/day)",
},
]
print(f"{'Scenario':<35} {'GPU':<15} {'Hours':>6} {'$/hr':>6} {'Total $':>8} {'Notes'}")
print("-" * 100)
for s in scenarios:
total = s["hours"] * s["cost_per_hr"]
cost_str = "FREE" if total == 0 else f"${total:.2f}"
print(
f"{s['name']:<35} {s['gpu']:<15} {s['hours']:>6.1f} "
f"{s['cost_per_hr']:>6.2f} {cost_str:>8} {s['notes']}"
)
print()
print("Assumes: 2000 training examples, 3 epochs, 2048 seq length")
print("Unsloth makes 24GB consumer GPUs viable for 7B fine-tuning")
7. GRPO with Unsloth for Reasoning TrainingΒΆ
GRPO is even more memory-hungry than SFT because it generates G completions per step. Unslothβs custom generation kernels make this practical on consumer hardware.
# GRPO with Unsloth - complete code
UNSLOTH_GRPO_COMPLETE = """
# ============================================================
# GRPO Reasoning Training with Unsloth - Complete Pipeline
# ============================================================
from unsloth import FastLanguageModel
from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset
import re, torch
# 1. Load model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-1.5B-Instruct", # Small for demo
max_seq_length=1024,
load_in_4bit=True,
fast_inference=True, # Enable vLLM-compatible fast generation
)
# 2. Apply LoRA for GRPO
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=32,
use_rslora=True,
use_gradient_checkpointing="unsloth",
random_state=42,
)
# 3. Reward functions (from notebook 08)
def format_reward(completions, **kwargs):
pattern = r"<think>.*?</think>\\s*<answer>.*?</answer>"
return [1.0 if re.search(pattern, c, re.DOTALL) else 0.0 for c in completions]
def correctness_reward(completions, ground_truth, **kwargs):
rewards = []
for comp, gt in zip(completions, ground_truth):
match = re.search(r"<answer>(.*?)</answer>", comp, re.DOTALL)
if match:
pred = match.group(1).strip()
# Numeric comparison
try:
if abs(float(pred.replace(',','')) - float(str(gt).replace(',',''))) < 1e-6:
rewards.append(1.0); continue
except ValueError:
if pred.lower() == str(gt).lower():
rewards.append(1.0); continue
rewards.append(0.0)
return rewards
def combined_reward(completions, ground_truth, **kwargs):
fmt = format_reward(completions)
corr = correctness_reward(completions, ground_truth=ground_truth)
return [0.2 * f + 0.8 * c for f, c in zip(fmt, corr)]
# 4. Load and format GSM8K
import re as _re
dataset = load_dataset("openai/gsm8k", "main", split="train")
def format_for_grpo(examples):
prompts = []
answers = []
for q, a in zip(examples["question"], examples["answer"]):
prompts.append(
"<|im_start|>system\n"
"Think inside <think>...</think> then answer in <answer>...</answer>.\n"
"<|im_end|>\n"
f"<|im_start|>user\n{q}<|im_end|>\n"
"<|im_start|>assistant\n"
)
gt_match = _re.search(r"####\\s*([\\d,\\.]+)", a)
answers.append(gt_match.group(1).replace(",", "") if gt_match else a.strip())
return {"prompt": prompts, "ground_truth": answers}
formatted = dataset.map(format_for_grpo, batched=True, remove_columns=dataset.column_names)
# 5. GRPO config
grpo_config = GRPOConfig(
output_dir="./unsloth-grpo-reasoning",
num_train_epochs=1,
per_device_train_batch_size=2, # Unsloth allows 2x larger batch
gradient_accumulation_steps=4,
learning_rate=5e-7,
bf16=True,
num_generations=6,
max_completion_length=512,
max_prompt_length=256,
beta=0.04,
temperature=0.9,
use_vllm=True,
vllm_gpu_memory_utilization=0.5,
logging_steps=5,
save_steps=100,
)
# 6. Train
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
reward_funcs=combined_reward,
args=grpo_config,
train_dataset=formatted,
)
trainer.train()
"""
print(UNSLOTH_GRPO_COMPLETE)
8. Extended Context Training with UnslothΒΆ
One of Unslothβs most impressive capabilities: training with extremely long context windows.
Standard training at 32K+ tokens would OOM on almost any GPU. Unsloth makes it possible through a combination of:
Custom RoPE implementation that avoids intermediate tensor materializations
Gradient checkpointing optimized for long sequences
Chunked processing of long sequences during forward pass
# Extended context training example
EXTENDED_CONTEXT_CODE = """
# ============================================================
# Extended Context Training with Unsloth
# Train Llama 3.1 8B at 342K context on a single GPU!
# ============================================================
from unsloth import FastLanguageModel
import torch
# Standard fine-tuning context limits (approximate, varies by GPU):
# | GPU | Standard TRL | Unsloth |
# |---------|-------------|------------|
# | T4 16GB | 2K tokens | 8K tokens |
# | RTX 4090| 8K tokens | 64K tokens |
# | A100 40G| 32K tokens | 256K tokens|
# | A100 80G| 64K tokens | 342K tokens|
# Llama 3.1 8B at 342K tokens
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B",
max_seq_length=342_000, # 342K tokens!
load_in_4bit=True,
# RoPE scaling is handled automatically
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
use_gradient_checkpointing="unsloth",
# For very long sequences, use fewer target modules
)
# Use for:
# - Document summarization (legal, research papers)
# - Long-form code understanding (entire codebases)
# - Multi-document QA
# - Book-length text analysis
# Qwen 2.5 at 128K context (more practical)
model_128k, tokenizer_128k = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-Instruct",
max_seq_length=131_072, # 128K tokens
load_in_4bit=True,
)
"""
print(EXTENDED_CONTEXT_CODE)
# Context length vs memory table
import numpy as np
import matplotlib.pyplot as plt
seq_lengths = [2048, 4096, 8192, 16384, 32768, 65536, 131072]
standard_trl_mem = [8, 12, 22, 42, 80, None, None] # OOM beyond ~40K on A100 40GB
unsloth_mem = [7, 8, 10, 14, 20, 32, 56]
fig, ax = plt.subplots(figsize=(10, 5))
ax.set_title("Memory Usage vs Sequence Length - 8B Model (A100 80GB)", fontsize=12)
x_labels = [f"{s//1024}K" if s >= 1024 else str(s) for s in seq_lengths]
x = range(len(seq_lengths))
valid_std = [(i, m) for i, m in enumerate(standard_trl_mem) if m is not None]
xi, yi = zip(*valid_std)
ax.plot(xi, yi, "o-", color="#e74c3c", linewidth=2, markersize=8, label="Standard TRL")
ax.plot(x, unsloth_mem, "s-", color="#2ecc71", linewidth=2, markersize=8, label="Unsloth")
ax.axhline(y=80, color="gray", linestyle=":", alpha=0.7, label="A100 80GB limit")
ax.axhline(y=40, color="orange", linestyle=":", alpha=0.7, label="A100 40GB limit")
ax.axhline(y=16, color="red", linestyle=":", alpha=0.7, label="T4 free tier limit")
ax.annotate("OOM with standard TRL", xy=(4, 80), xytext=(4.5, 70),
arrowprops=dict(arrowstyle="->"), fontsize=9, color="#e74c3c")
ax.set_xticks(x)
ax.set_xticklabels(x_labels)
ax.set_xlabel("Sequence Length")
ax.set_ylabel("Peak GPU Memory (GB)")
ax.legend(fontsize=9)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig("unsloth_context_memory.png", dpi=120, bbox_inches="tight")
plt.show()
print("Context memory chart saved.")
9. Saving to GGUF FormatΒΆ
GGUF is the format used by llama.cpp, Ollama, LM Studio, and Jan. Unsloth can export directly to GGUF without needing to install llama.cpp separately.
This is a major convenience feature - standard HuggingFace requires you to:
Merge the LoRA adapter
Install llama.cpp
Run the convert script
Run the quantize script
Unsloth does all four steps with one function call.
# Saving models with Unsloth - all formats
SAVING_CODE = """
# ============================================================
# Unsloth Model Saving - All Formats
# ============================================================
# 1. Save LoRA adapter only (smallest, ~50-500 MB)
model.save_pretrained("./my-lora-adapter")
tokenizer.save_pretrained("./my-lora-adapter")
# 2. Push LoRA adapter to HuggingFace Hub
model.push_to_hub("your-username/my-reasoning-adapter", token="hf_...")
tokenizer.push_to_hub("your-username/my-reasoning-adapter", token="hf_...")
# 3. Merge and save as full BF16 model (largest, best quality)
model.save_pretrained_merged(
"./my-merged-model",
tokenizer,
save_method="merged_16bit", # Full BF16 precision
)
# 4. Merge and save as 4-bit (good balance)
model.save_pretrained_merged(
"./my-merged-4bit",
tokenizer,
save_method="merged_4bit_forced", # 4-bit quantized
)
# 5. Export to GGUF (for llama.cpp, Ollama, LM Studio)
# Single quantization
model.save_pretrained_gguf(
"./my-gguf-model",
tokenizer,
quantization_method="q4_k_m", # Recommended: best quality/size
)
# Multiple GGUF quantizations at once
model.save_pretrained_gguf(
"./my-gguf-all",
tokenizer,
quantization_method=["q4_k_m", "q8_0", "f16"], # Multiple outputs
)
# 6. Push GGUF directly to HuggingFace Hub
model.push_to_hub_gguf(
"your-username/my-model-gguf",
tokenizer,
quantization_method="q4_k_m",
token="hf_...",
)
"""
print(SAVING_CODE)
# GGUF quantization methods explained
print("\n" + "=" * 65)
print("GGUF Quantization Methods Reference")
print("=" * 65)
gguf_methods = [
("q2_k", "2-bit", "Smallest, lowest quality", "Extreme size limit"),
("q3_k_m", "3-bit", "Very small, acceptable quality", "Edge devices"),
("q4_0", "4-bit", "Small, decent quality", "Legacy default"),
("q4_k_m", "4-bit", "Best 4-bit quality (RECOMMENDED)", "Most use cases"),
("q5_k_m", "5-bit", "High quality, larger", "Quality priority"),
("q6_k", "6-bit", "Very high quality, near lossless", "Max quality 6-bit"),
("q8_0", "8-bit", "Near-lossless, large file", "Near full precision"),
("f16", "16-bit", "Full BF16, largest", "Archive / inference server"),
]
print(f"{'Method':<10} {'Bits':<8} {'Quality':<40} {'Use Case'}")
print("-" * 80)
for method, bits, quality, use_case in gguf_methods:
marker = " <-- RECOMMENDED" if method == "q4_k_m" else ""
print(f"{method:<10} {bits:<8} {quality:<40} {use_case}{marker}")
10. Llama 4 and Qwen 3 Support in Unsloth (2025)ΒΆ
Unsloth added support for the major 2025 model releases quickly after their launch. These models have unique architectures that required specific kernel work.
# 2025 model support notes
NEW_MODELS_2025 = """
# ============================================================
# New 2025 Models in Unsloth
# ============================================================
# ---- Llama 4 (Meta, 2025) ----
# Architecture: Mixture of Experts (MoE)
# Scout: 17B active params, 109B total (16 experts)
# Maverick: 17B active params, 400B total (128 experts)
# Note: Requires Meta's access request at meta.com/llama
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-4-Scout-17B-16E-Instruct",
max_seq_length=10_000_000, # 10M token context!
load_in_4bit=True,
token="hf_...", # Required: gated model
)
# Fine-tune with LoRA on active expert parameters only
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
# MoE layers (experts) are automatically handled
)
# ---- Qwen 3 (Alibaba, 2025) ----
# Key innovation: 'thinking_mode' for explicit reasoning
# /think and /no_think tokens control reasoning behavior
# QwQ integration for math reasoning
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen3-8B",
max_seq_length=32768,
load_in_4bit=True,
)
# Qwen 3 supports thinking mode in generation:
# prompt + "/think" -> enables chain-of-thought
# prompt + "/no_think" -> fast direct answer
# ---- Gemma 3 (Google, 2025) ----
# Improvements: multimodal (text+image), better reasoning
# 4B/12B/27B sizes available
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/gemma-3-12b-it",
max_seq_length=8192,
load_in_4bit=True,
)
# ---- Phi-4 Mini (Microsoft, 2025) ----
# Only 3.8B parameters, surprising quality for its size
# Excellent for on-device deployment
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Phi-4-mini-instruct",
max_seq_length=16384,
load_in_4bit=True,
)
"""
print(NEW_MODELS_2025)
11. Free Tier Strategy: Colab T4 and Kaggle P100ΒΆ
With Unsloth, free GPU tiers become genuinely useful for LLM fine-tuning.
Free GPU Resources (2025)ΒΆ
Platform |
GPU |
VRAM |
Free Hours/Day |
Best For |
|---|---|---|---|---|
Google Colab (free) |
T4 |
16 GB |
~3-4 hrs |
Quick experiments |
Google Colab Pro |
A100 |
40 GB |
12-24 hrs |
Serious training |
Kaggle |
P100/T4 |
16 GB |
30 hrs/week |
Reproducible notebooks |
HuggingFace Spaces |
T4 (ZeroGPU) |
16 GB |
Community |
Demos |
What Fits on a T4 (16 GB) with UnslothΒΆ
Model Size |
Task |
Max Seq Length |
Training Time |
|---|---|---|---|
1.5B |
SFT |
8192 |
~30 min |
7B |
SFT |
2048 |
~2 hrs |
7B |
GRPO (G=4) |
512 |
~3 hrs |
14B |
SFT |
512 |
~4 hrs (tight) |
# Colab T4 optimized configuration
COLAB_T4_CONFIG = """
# ============================================================
# Optimized for Google Colab T4 (16GB, FREE tier)
# ============================================================
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
import torch
# Check we're on T4
assert torch.cuda.is_available(), "GPU required!"
gpu_name = torch.cuda.get_device_name(0)
gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"GPU: {gpu_name} ({gpu_mem:.0f} GB)")
# T4-optimized model choice: 7B at 4-bit fits perfectly
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-Instruct",
max_seq_length=2048, # T4 limit for 7B
load_in_4bit=True,
dtype=torch.float16, # T4 doesn't have BF16 hardware support
)
model = FastLanguageModel.get_peft_model(
model,
r=32, # Lower rank to save memory on T4
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], # Fewer modules
lora_alpha=64,
use_gradient_checkpointing="unsloth", # Essential on T4
random_state=42,
)
# T4-friendly training args
training_args = TrainingArguments(
output_dir="./t4-output",
num_train_epochs=1,
per_device_train_batch_size=1, # Small batch for T4
gradient_accumulation_steps=16, # Simulate larger batch
learning_rate=2e-4,
fp16=True, # T4 uses FP16 (not BF16)
optim="adamw_8bit",
logging_steps=5,
save_steps=50,
warmup_ratio=0.1,
max_grad_norm=1.0,
dataloader_num_workers=2,
)
# Colab tip: enable high RAM runtime
# Runtime -> Change runtime type -> High-RAM
# This gives 25 GB system RAM for data loading
# Anti-crash tip for Colab: save frequently
# Colab disconnects after ~90 min of inactivity
# Use save_steps=50 and resume_from_checkpoint=True
"""
KAGGLE_CONFIG = """
# ============================================================
# Optimized for Kaggle (P100/T4, 30 hrs/week FREE)
# ============================================================
# Kaggle advantages vs Colab:
# - More stable (no disconnects)
# - GPU: 30 hours per week total
# - Internet access enabled in settings
# - Persistent storage (up to 20 GB)
# Kaggle P100 (16GB, older but still useful)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
dtype=torch.float16, # P100 doesn't support BF16
)
# Install Unsloth on Kaggle:
# !pip install unsloth -q
# Enable internet in Kaggle notebook settings first
# Save to Kaggle output directory (persists across sessions)
model.save_pretrained("/kaggle/working/my-adapter")
"""
print("Colab T4 Configuration:")
print(COLAB_T4_CONFIG)
print("\nKaggle Configuration:")
print(KAGGLE_CONFIG)
12. Full End-to-End Example: Custom Chatbot in One NotebookΒΆ
Putting it all together: fine-tune a custom chatbot from scratch to deployment.
# Complete pipeline from data to deployable model
COMPLETE_PIPELINE = """
# ============================================================
# Complete Unsloth Fine-Tuning Pipeline: Data -> Deployment
# ============================================================
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset, Dataset
import torch, json
# ---- STEP 1: Prepare your data ----
# Format: list of conversation dicts
training_data = [
{
"messages": [
{"role": "system", "content": "You are an expert Python tutor."},
{"role": "user", "content": "What is a list comprehension?"},
{"role": "assistant", "content": "A list comprehension is a concise way..."}
]
},
# ... more examples
]
# Or load from HuggingFace
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:5000]")
def format_messages(example):
text = ""
for msg in example["messages"]:
text += f"<|im_start|>{msg['role']}\\n{msg['content']}<|im_end|>\\n"
return {"text": text}
dataset = dataset.map(format_messages, remove_columns=dataset.column_names)
# ---- STEP 2: Load model ----
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
)
# ---- STEP 3: Configure LoRA ----
model = FastLanguageModel.get_peft_model(
model, r=64, lora_alpha=128,
target_modules="all-linear",
use_rslora=True, use_gradient_checkpointing="unsloth",
)
# ---- STEP 4: Train ----
trainer = SFTTrainer(
model=model, tokenizer=tokenizer,
train_dataset=dataset, dataset_text_field="text",
max_seq_length=2048, packing=True,
args=TrainingArguments(
output_dir="./output", num_train_epochs=3,
per_device_train_batch_size=4, gradient_accumulation_steps=4,
learning_rate=2e-4, bf16=True, optim="adamw_8bit",
logging_steps=10, save_steps=100, warmup_ratio=0.1,
)
)
trainer.train()
# ---- STEP 5: Test the model ----
FastLanguageModel.for_inference(model) # Switch to inference mode
inputs = tokenizer([
"<|im_start|>user\\nExplain recursion in Python<|im_end|>\\n<|im_start|>assistant\\n"
], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256, use_cache=True)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
# ---- STEP 6: Save for deployment ----
# Option A: LoRA adapter (for HuggingFace inference)
model.save_pretrained("./my-python-tutor")
tokenizer.save_pretrained("./my-python-tutor")
# Option B: GGUF for Ollama/LM Studio
model.save_pretrained_gguf("./my-python-tutor-gguf", tokenizer, "q4_k_m")
# Option C: Push to HuggingFace Hub
model.push_to_hub_gguf("username/my-python-tutor", tokenizer, "q4_k_m", token="hf_...")
# ---- DEPLOY WITH OLLAMA ----
# After saving GGUF:
# 1. Create Modelfile:
# FROM ./my-python-tutor-gguf/model-unsloth.Q4_K_M.gguf
# SYSTEM You are an expert Python tutor.
# 2. ollama create my-python-tutor -f Modelfile
# 3. ollama run my-python-tutor
"""
print(COMPLETE_PIPELINE)
# Inference with Unsloth-trained model
# FastLanguageModel.for_inference() enables 2x faster inference
INFERENCE_CODE = """
# ============================================================
# Fast Inference with Unsloth
# ============================================================
from unsloth import FastLanguageModel
import torch
# Load the fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="./my-python-tutor", # Your saved adapter
max_seq_length=2048,
load_in_4bit=True,
)
# CRITICAL: Switch to inference mode (disables LoRA training overhead)
FastLanguageModel.for_inference(model)
def chat(user_message: str, system_prompt: str = None, max_tokens: int = 512) -> str:
"""Simple chat function using the fine-tuned model."""
sys = system_prompt or "You are a helpful AI assistant."
prompt = (
f"<|im_start|>system\\n{sys}<|im_end|>\\n"
f"<|im_start|>user\\n{user_message}<|im_end|>\\n"
"<|im_start|>assistant\\n"
)
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
with torch.inference_mode():
outputs = model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=0.7,
do_sample=True,
use_cache=True, # Unsloth: always use KV cache
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
)
return response.strip()
# Test
response = chat("Explain list comprehensions in Python with examples.")
print(response)
# Streaming with Unsloth
from transformers import TextStreamer
streamer = TextStreamer(tokenizer, skip_prompt=True)
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
_ = model.generate(**inputs, max_new_tokens=256, streamer=streamer, use_cache=True)
"""
print(INFERENCE_CODE)
13. Performance Comparison SummaryΒΆ
Comprehensive benchmark: Unsloth vs Axolotl vs standard TRL across different tasks.
import matplotlib.pyplot as plt
import numpy as np
# Benchmark data (7B model, A100 80GB)
tasks = ["SFT 2K ctx", "SFT 8K ctx", "DPO", "GRPO (G=6)", "Full FT"]
# Tokens per second
standard_trl_tps = [1850, 920, 1200, 310, 890]
axolotl_tps = [2100, 1100, 1350, None, 980]
unsloth_tps = [5200, 3800, 3900, 1250, 4100]
# Memory (GB)
standard_trl_mem = [38.2, 62.1, 44.5, 71.3, 73.8]
axolotl_mem = [32.5, 54.0, 38.2, None, 68.5]
unsloth_mem = [7.8, 14.2, 11.4, 22.1, 42.3]
x = np.arange(len(tasks))
width = 0.28
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
fig.suptitle("Performance Comparison: Unsloth vs Axolotl vs Standard TRL (7B, A100 80GB)", fontsize=12)
# Speed
bars_std = ax1.bar(x - width, standard_trl_tps, width, label="Standard TRL", color="#e74c3c", alpha=0.85)
ax1_axo = [v if v else 0 for v in axolotl_tps]
bars_axo = ax1.bar(x, ax1_axo, width, label="Axolotl", color="#3498db", alpha=0.85)
bars_unsloth = ax1.bar(x + width, unsloth_tps, width, label="Unsloth", color="#2ecc71", alpha=0.85)
ax1.set_title("Training Speed (Tokens/sec, higher = better)")
ax1.set_xticks(x)
ax1.set_xticklabels(tasks, rotation=15, ha="right")
ax1.set_ylabel("Tokens per Second")
ax1.legend()
ax1.grid(axis="y", alpha=0.3)
# Add speedup annotations on Unsloth bars
for i, (bar, std_val) in enumerate(zip(bars_unsloth, standard_trl_tps)):
if bar.get_height() > 0 and std_val > 0:
speedup = bar.get_height() / std_val
ax1.text(
bar.get_x() + bar.get_width() / 2,
bar.get_height() + 50,
f"{speedup:.1f}x",
ha="center", va="bottom", fontsize=8, fontweight="bold", color="#27ae60"
)
# Memory
mem_axo = [v if v else 0 for v in axolotl_mem]
ax2.bar(x - width, standard_trl_mem, width, label="Standard TRL", color="#e74c3c", alpha=0.85)
ax2.bar(x, mem_axo, width, label="Axolotl", color="#3498db", alpha=0.85)
bars_mem = ax2.bar(x + width, unsloth_mem, width, label="Unsloth", color="#2ecc71", alpha=0.85)
ax2.axhline(y=80, color="gray", linestyle=":", alpha=0.6, label="A100 80GB")
ax2.axhline(y=40, color="orange", linestyle=":", alpha=0.6, label="A100 40GB")
ax2.axhline(y=16, color="red", linestyle=":", alpha=0.6, label="T4 free tier")
ax2.set_title("Peak GPU Memory (GB, lower = better)")
ax2.set_xticks(x)
ax2.set_xticklabels(tasks, rotation=15, ha="right")
ax2.set_ylabel("GPU Memory (GB)")
ax2.legend(fontsize=8)
ax2.grid(axis="y", alpha=0.3)
plt.tight_layout()
plt.savefig("unsloth_full_comparison.png", dpi=120, bbox_inches="tight")
plt.show()
print("Full comparison chart saved.")
14. Key Takeaways and Recommended WorkflowΒΆ
When to Use UnslothΒΆ
Always - unless you have a specific reason not to:
Same API as standard TRL/PEFT
Zero accuracy degradation
2-5x faster for free
Works with all 2025 SOTA models
The 2025 Unsloth Best PracticesΒΆ
Model loading: Use
unsloth/namespace models (pre-patched)Quantization:
load_in_4bit=True+ Dynamic 2.0 for most casesLoRA:
target_modules="all-linear"+use_rslora=True+use_gradient_checkpointing="unsloth"Training:
packing=True,optim="adamw_8bit",bf16=TrueSaving:
save_pretrained_gguffor local deployment,push_to_hub_gguffor sharingInference: Always call
FastLanguageModel.for_inference(model)before generating
Recommended Hardware SelectionΒΆ
Budget |
Hardware |
Best Model Choice |
|---|---|---|
Free |
Colab T4 / Kaggle P100 |
7B models |
$1-2/hr |
RTX 4090 (Vast.ai) |
7B-14B models |
$2-4/hr |
A10G or L4 |
14B-32B models |
$4-8/hr |
A100 40GB |
32B-70B models |
$8-16/hr |
A100 80GB |
70B+ or long context |
Quick Reference: Unsloth vs Standard TRL DecisionΒΆ
Do you have GPU memory to spare? No -> Use Unsloth
Do you need maximum speed? Yes -> Use Unsloth
Are you on a free tier GPU? Yes -> Use Unsloth
Do you need GGUF export? Yes -> Use Unsloth
Are you training 7B+ models? Yes -> Use Unsloth
Otherwise? -> Use Unsloth :)
Next StepsΒΆ
Combine notebook 08 (GRPO) + this notebook (Unsloth) for the full R1-style training pipeline
Try fine-tuning on your own domain-specific dataset
Deploy your GGUF model with Ollama:
ollama create my-model -f ModelfileScale up: use multi-GPU with
accelerate launch(Unsloth is compatible)
# Quick reference: complete Unsloth workflow in one cell
UNSLOTH_QUICK_REF = """
# ============================================================
# Unsloth Complete Quick Reference
# ============================================================
# Install
# pip install unsloth
# Load
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Qwen2.5-7B-Instruct", max_seq_length=2048, load_in_4bit=True
)
# LoRA
model = FastLanguageModel.get_peft_model(
model, r=64, lora_alpha=128,
target_modules="all-linear",
use_rslora=True,
use_gradient_checkpointing="unsloth",
)
# Train (SFT)
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model, tokenizer=tokenizer,
train_dataset=dataset, dataset_text_field="text",
max_seq_length=2048, packing=True,
args=TrainingArguments(
output_dir="./out", num_train_epochs=3,
per_device_train_batch_size=4, gradient_accumulation_steps=4,
learning_rate=2e-4, bf16=True, optim="adamw_8bit",
)
)
trainer.train()
# Inference
FastLanguageModel.for_inference(model)
outputs = model.generate(**tokenizer([prompt], return_tensors="pt").to("cuda"),
max_new_tokens=256, use_cache=True)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
# Save
model.save_pretrained("./adapter") # LoRA adapter
model.save_pretrained_gguf("./gguf", tokenizer, "q4_k_m") # GGUF
model.push_to_hub("username/model", token="hf_...") # HF Hub
"""
print(UNSLOTH_QUICK_REF)