Run this notebook: Open in Colab Open in Kaggle

Unsloth - 2x-5x Faster Fine-Tuning with 80% Less VRAM (2025)¶

Unsloth is the fastest open-source LLM fine-tuning library in 2025. It achieves its speedups through hand-written custom CUDA kernels - not approximations - so you get the same model quality as standard TRL/PEFT, just much faster and cheaper.

What You Will Learn¶

Why Unsloth is faster (custom kernels, not tricks)
Loading any 2025 model (Qwen 3, Llama 4, Phi-4, Gemma 3) with Unsloth
Dynamic 4-bit quantization: Unsloth Dynamic 2.0
Full SFT pipeline with UnslothTrainer
GRPO with Unsloth for 5x faster reasoning training
Extended context: 342K token training on Llama 3.1 8B
Saving to GGUF for Ollama/llama.cpp deployment
Memory and speed benchmarks vs standard TRL
Free tier strategies: Colab T4 and Kaggle P100

Prerequisites¶

Completed notebooks 02-05 (SFT, LoRA, QLoRA, DPO)
Notebook 08 (GRPO) recommended for the GRPO section
GPU with 8-16 GB VRAM (or free Colab/Kaggle)

# Install Unsloth (2025 - supports CUDA 12.x)
# !pip install unsloth
#
# For specific CUDA versions:
# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"  # Colab
# !pip install unsloth[cu121]    # CUDA 12.1
# !pip install unsloth[cu124]    # CUDA 12.4+
#
# Dependencies (usually auto-installed):
# !pip install xformers trl>=0.12.0 peft>=0.13.0 accelerate>=1.2.0

import torch
import sys

print(f"Python: {sys.version.split()[0]}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU: {gpu_name}")
    print(f"VRAM: {gpu_mem:.1f} GB")

    # Check if Unsloth supports this GPU
    if "T4" in gpu_name:
        print("Free tier GPU detected (Colab T4). Unsloth will use 16GB efficiently.")
    elif "P100" in gpu_name:
        print("Free tier GPU detected (Kaggle P100). Unsloth works great here.")
    elif "A100" in gpu_name or "A10" in gpu_name:
        print("Production GPU. Unsloth will provide maximum speedup.")
    elif "4090" in gpu_name or "3090" in gpu_name:
        print("Consumer flagship GPU. Unsloth excels here.")
    else:
        print("GPU detected. Unsloth supports most NVIDIA GPUs.")
else:
    print("No GPU detected. Unsloth requires a CUDA GPU.")

# Try importing Unsloth
try:
    from unsloth import FastLanguageModel
    print("\nUnsloth imported successfully!")
except ImportError:
    print("\nUnsloth not installed. Run: pip install unsloth")
    print("Showing code examples that will work once installed.")

1. What Is Unsloth and Why It Matters¶

The Standard Fine-Tuning Bottleneck¶

Standard QLoRA (bitsandbytes + PEFT + TRL) is great but has overhead:

HuggingFace attention uses generic PyTorch - not optimized for your exact hardware
Gradient checkpointing has Python-level overhead per operation
The LoRA matrix math uses standard cuBLAS calls

How Unsloth Is Different¶

Unsloth rewrites the critical path in custom CUDA/Triton kernels:

Custom attention kernel: RoPE, attention computation, and softmax fused into one kernel
Fused LoRA operations: The A/B matrix multiplications and addition are fused
Custom gradient checkpointing: Python overhead removed, pure CUDA
Smarter memory management: Avoids unnecessary tensor copies during backprop

This is not approximation. It is the exact same math, done more efficiently.

Key Advantages Over Competitors¶

Feature	Unsloth	Standard TRL	Axolotl
Speed (vs baseline)	2-5x faster	1x (baseline)	~1.5x
Memory reduction	60-80% less	baseline	~20%
Accuracy vs baseline	Identical	Identical	Identical
AMD GPU support	Yes (ROCm 2025)	Yes	Yes
Intel GPU support	Yes (2025)	Partial	No
GGUF export	Built-in	Manual	Manual
Setup complexity	Low (pip install)	Medium	High (YAML config)
Free tier compatible	Yes	Partial	No

AMD and Intel GPU Support (2025)¶

Unsloth 2025 expanded beyond NVIDIA:

AMD ROCm 6.x: RX 7900 XTX, MI300X supported
Intel Arc / Gaudi 3: via Intel Extension for PyTorch
Same API, kernels automatically dispatch to the right backend

2. Loading Models with Unsloth¶

FastLanguageModel.from_pretrained is a drop-in replacement for AutoModelForCausalLM.from_pretrained + BitsAndBytesConfig.

It handles quantization, device mapping, and Unsloth kernel patching in one call.

# ============================================================
# Loading models with Unsloth (2025 supported models)
# ============================================================

# Unsloth hosts pre-patched versions of popular models on HuggingFace
# using the "unsloth/" namespace - these load faster than the originals.

SUPPORTED_MODELS_2025 = {
    # Qwen 2.5 family (recommended for most tasks)
    "qwen2.5_1.5b": "unsloth/Qwen2.5-1.5B-Instruct",
    "qwen2.5_7b":   "unsloth/Qwen2.5-7B-Instruct",
    "qwen2.5_14b":  "unsloth/Qwen2.5-14B-Instruct",
    "qwen2.5_72b":  "unsloth/Qwen2.5-72B-Instruct-bnb-4bit",

    # Qwen 3 family (2025 - best reasoning + multilingual)
    "qwen3_4b":     "unsloth/Qwen3-4B",
    "qwen3_8b":     "unsloth/Qwen3-8B",
    "qwen3_30b":    "unsloth/Qwen3-30B-A3B",  # MoE: 30B total, 3B active

    # Llama 3.x family
    "llama3.2_1b":  "unsloth/Llama-3.2-1B-Instruct",
    "llama3.2_3b":  "unsloth/Llama-3.2-3B-Instruct",
    "llama3.3_70b": "unsloth/Llama-3.3-70B-Instruct-bnb-4bit",

    # Llama 4 (2025)
    "llama4_scout": "unsloth/Llama-4-Scout-17B-16E",  # 17B active of 109B
    "llama4_maverick": "unsloth/Llama-4-Maverick-17B-128E",

    # Phi-4 (Microsoft)
    "phi4":         "unsloth/phi-4",
    "phi4_mini":    "unsloth/Phi-4-mini-instruct",

    # Gemma 3 (Google, 2025)
    "gemma3_4b":    "unsloth/gemma-3-4b-it",
    "gemma3_12b":   "unsloth/gemma-3-12b-it",
    "gemma3_27b":   "unsloth/gemma-3-27b-it",

    # DeepSeek R1 family (reasoning)
    "deepseek_r1_7b":  "unsloth/DeepSeek-R1-Distill-Qwen-7B",
    "deepseek_r1_70b": "unsloth/DeepSeek-R1-Distill-Llama-70B-bnb-4bit",
}

print("Unsloth-supported models (2025):")
print()
for key, model_id in SUPPORTED_MODELS_2025.items():
    print(f"  {key:<20} -> {model_id}")

# ============================================================
# Standard Unsloth model loading
# ============================================================

# from unsloth import FastLanguageModel

MODEL_NAME = "unsloth/Qwen2.5-7B-Instruct"
MAX_SEQ_LENGTH = 2048

UNSLOTH_LOAD_CODE = f'''
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="{MODEL_NAME}",
    max_seq_length={MAX_SEQ_LENGTH},      # Context window
    load_in_4bit=True,                   # 4-bit quantization
    # load_in_8bit=False,                # 8-bit (slower, more accurate)
    # full_finetuning=False,             # Set True for full fine-tuning
    dtype=None,                          # Auto-detect (BF16 on Ampere+)
    # token="hf_...",                    # For gated models (Llama 4)
)

print(f"Model loaded: {MODEL_NAME}")
print(f"Memory: {{model.get_memory_footprint() / 1e9:.2f}} GB")
# 7B with 4-bit: ~5 GB (vs 14 GB for FP16)
'''

print("Standard Unsloth loading code:")
print(UNSLOTH_LOAD_CODE)

# Demonstrate standard loading with bitsandbytes for comparison
print("\n" + "="*60)
print("Standard HuggingFace equivalent (for comparison):")
print("="*60)

HF_EQUIVALENT = """
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Then separately: prepare_model_for_kbit_training, get_peft_model...
# Unsloth combines all of this into one call
"""
print(HF_EQUIVALENT)

3. Unsloth Dynamic 2.0 - Smarter Quantization¶

Standard NF4 quantization applies 4-bit compression uniformly to all layers. Unsloth Dynamic 2.0 (released 2025) is smarter:

How Dynamic Quantization Works¶

Not all layers are equally sensitive to quantization:

Embedding layers: Very sensitive. 4-bit here hurts accuracy significantly.
First/last transformer layers: More sensitive than middle layers.
Attention vs MLP: Attention layers are more sensitive.

Unsloth Dynamic 2.0 keeps sensitive layers in higher precision (8-bit or 16-bit) and quantizes insensitive layers to 4-bit. The result:

Approach	Accuracy	Memory	Speed
BF16 (full precision)	Baseline	100%	1x
Uniform NF4 4-bit	-2.5% avg	35%	2x
Unsloth Dynamic 2.0	-0.8% avg	38%	1.9x

Dynamic 2.0 gives nearly full-precision accuracy at 4-bit memory cost.

Activating it is one parameter change:

# Dynamic 2.0 quantization example

DYNAMIC_QUANT_CODE = """
from unsloth import FastLanguageModel

# Option 1: Unsloth Dynamic 2.0 (best accuracy at 4-bit)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
    # Unsloth Dynamic 2.0: automatically skips critical layers
    # Enable by using the 'bnb-4bit' model variants on unsloth/ namespace
)

# Option 2: Unsloth's GGUF-based quantization (best for inference)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct-GGUF",
    max_seq_length=2048,
    # gguf_quant_type="Q4_K_M",  # GGUF quantization type
)

# Option 3: Full precision (best quality, most memory)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=2048,
    load_in_4bit=False,  # BF16 full precision
    full_finetuning=True,  # Enable full parameter fine-tuning
)
"""

print(DYNAMIC_QUANT_CODE)

# Memory comparison table
import pandas as pd

memory_data = {
    "Model": ["1.5B", "7B", "14B", "32B", "70B"],
    "BF16 (full)": ["3 GB", "14 GB", "28 GB", "64 GB", "140 GB"],
    "NF4 4-bit (bnb)": ["1 GB", "5 GB", "10 GB", "20 GB", "40 GB"],
    "Unsloth Dynamic 2.0": ["1.1 GB", "5.2 GB", "10.5 GB", "21 GB", "42 GB"],
    "With LoRA gradient": ["3 GB", "10 GB", "18 GB", "38 GB", "65 GB"],
    "Free tier fits?": ["T4 yes", "T4 yes", "A10 yes", "A100 only", "2xA100"],
}

df = pd.DataFrame(memory_data)
print("\nUnsloth Memory Requirements (training with LoRA):")
print(df.to_string(index=False))

4. LoRA Configuration with Unsloth¶

FastLanguageModel.get_peft_model replaces the PEFT get_peft_model + LoraConfig pattern. It applies Unsloth’s custom LoRA kernels automatically.

# LoRA with Unsloth - full example with explanations

UNSLOTH_LORA_CODE = """
from unsloth import FastLanguageModel

# Step 1: Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Step 2: Apply LoRA
model = FastLanguageModel.get_peft_model(
    model,

    # LoRA hyperparameters (2025 best practices)
    r=64,                          # Rank. 64 is standard in 2025.
    lora_alpha=128,                # Scaling (2x rank)
    target_modules=[
        "q_proj", "k_proj",        # Query and Key attention
        "v_proj", "o_proj",        # Value and Output attention
        "gate_proj", "up_proj",    # MLP layers
        "down_proj",
    ],
    # OR use 'all-linear' to target all linear layers automatically:
    # target_modules="all-linear",

    lora_dropout=0.0,              # Unsloth recommends 0 dropout
    bias="none",
    use_rslora=True,               # Rank-Stabilized LoRA
    use_gradient_checkpointing="unsloth",  # Unsloth's custom checkpointing
    random_state=42,

    # Optional: LoRA for longer context
    # loftq_config=None,           # LoftQ initialization
)

# Check parameters
model.print_trainable_parameters()
# Example output: trainable params: 39,976,960 || all params: 7,241,732,096 || trainable%: 0.55
"""

print("Unsloth LoRA setup:")
print(UNSLOTH_LORA_CODE)

# Compare LoRA setup: Standard PEFT vs Unsloth
STANDARD_LORA = """
# Standard PEFT approach (more verbose):
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

lora_config = LoraConfig(
    r=64,
    lora_alpha=128,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    use_rslora=True,
)
model = prepare_model_for_kbit_training(model)  # Extra step needed
model = get_peft_model(model, lora_config)
# Gradient checkpointing must be set separately
model.gradient_checkpointing_enable()
"""

print("\nStandard PEFT (for comparison):")
print(STANDARD_LORA)

5. Full SFT Pipeline with UnslothTrainer¶

Unsloth’s UnslothTrainer is a drop-in replacement for TRL’s SFTTrainer. Every feature of SFTTrainer works, plus Unsloth’s speedups apply automatically.

from datasets import load_dataset

# Load a dataset for demonstration
print("Loading dataset...")
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:2000]")
print(f"Loaded {len(dataset)} training examples")
print(f"Columns: {dataset.column_names}")
print(f"\nExample:\n{str(dataset[0]['messages'][:2])[:200]}...")

# Format dataset for instruction tuning

def convert_messages_to_text(example):
    """
    Convert chat messages to ChatML format text.
    Unsloth's SFTTrainer can also handle raw messages with apply_chat_template.
    """
    text = ""
    for msg in example["messages"]:
        role = msg["role"]
        content = msg["content"]
        text += f"<|im_start|>{role}\n{content}<|im_end|>\n"
    return {"text": text}


formatted_dataset = dataset.map(
    convert_messages_to_text,
    remove_columns=dataset.column_names,
)

print(f"Formatted dataset: {len(formatted_dataset)} examples")
print(f"\nSample (first 300 chars):\n{formatted_dataset[0]['text'][:300]}...")

# Complete Unsloth SFT training pipeline

UNSLOTH_SFT_FULL = """
# ============================================================
# Complete Unsloth SFT Pipeline
# ============================================================
from unsloth import FastLanguageModel
from unsloth import UnslothTrainer, UnslothTrainingArguments
from trl import SFTTrainer  # Or use UnslothTrainer for auto-optimizations
from transformers import TrainingArguments
from datasets import load_dataset

# 1. Load model with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# 2. Apply LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    lora_alpha=128,
    target_modules="all-linear",
    use_rslora=True,
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

# 3. Training arguments
training_args = TrainingArguments(
    output_dir="./unsloth-sft-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,   # Unsloth allows larger batches
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    bf16=True,
    optim="adamw_8bit",             # Unsloth's 8-bit Adam
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    logging_steps=10,
    save_steps=100,
    save_total_limit=2,
    report_to="none",
    max_grad_norm=1.0,
)

# 4. Trainer (SFTTrainer works with Unsloth model)
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted_dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    packing=True,                   # Essential for speed
    args=training_args,
)

# 5. Show memory before training
gpu_stats = torch.cuda.get_device_properties(0)
start_mem = torch.cuda.max_memory_reserved() / 1e9
max_mem = gpu_stats.total_memory / 1e9
print(f"GPU: {gpu_stats.name} ({max_mem:.1f} GB)")
print(f"Reserved: {start_mem:.2f} GB / {max_mem:.1f} GB")

# 6. Train!
trainer_stats = trainer.train()

# 7. Report stats
end_mem = torch.cuda.max_memory_reserved() / 1e9
print(f"\nTraining complete!")
print(f"Time: {trainer_stats.metrics['train_runtime']:.1f}s")
print(f"Tokens/sec: {trainer_stats.metrics['train_samples_per_second']:.1f}")
print(f"Peak memory used: {end_mem:.2f} GB / {max_mem:.1f} GB")
"""

print(UNSLOTH_SFT_FULL)

6. Speed and Memory Benchmarks¶

Here are real-world benchmarks comparing Unsloth, standard TRL, and Axolotl. Numbers are from Unsloth’s official benchmarks and community reproductions (2025).

import matplotlib.pyplot as plt
import numpy as np

# Benchmark data: 7B model SFT, batch_size=1, seq_len=2048, A100 80GB
frameworks = ["Standard TRL", "Axolotl", "Unsloth (LoRA)", "Unsloth (Full)"]

# Tokens per second (higher = faster)
tokens_per_sec = [1850, 2100, 5200, 4800]

# Peak GPU memory in GB (lower = better)
peak_memory_gb = [38.2, 32.5, 7.8, 14.2]

colors = ["#95a5a6", "#3498db", "#2ecc71", "#27ae60"]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle("Unsloth vs Alternatives - 7B Model SFT (A100 80GB)", fontsize=13)

# Speed comparison
bars1 = axes[0].bar(frameworks, tokens_per_sec, color=colors, edgecolor="white", linewidth=1.5)
axes[0].set_title("Training Speed")
axes[0].set_ylabel("Tokens per Second")
axes[0].set_ylim(0, 6500)
for bar, val in zip(bars1, tokens_per_sec):
    speedup = val / tokens_per_sec[0]
    axes[0].text(
        bar.get_x() + bar.get_width() / 2,
        bar.get_height() + 100,
        f"{val:,}\n({speedup:.1f}x)",
        ha="center",
        va="bottom",
        fontsize=9,
        fontweight="bold",
    )
axes[0].tick_params(axis="x", rotation=20)

# Memory comparison
bars2 = axes[1].bar(frameworks, peak_memory_gb, color=colors, edgecolor="white", linewidth=1.5)
axes[1].set_title("Peak GPU Memory")
axes[1].set_ylabel("GB")
axes[1].set_ylim(0, 48)
axes[1].axhline(y=16, color="red", linestyle="--", alpha=0.7, label="Free tier GPU limit (16GB)")
axes[1].axhline(y=40, color="orange", linestyle="--", alpha=0.5, label="A100 40GB")
for bar, val in zip(bars2, peak_memory_gb):
    reduction = (peak_memory_gb[0] - val) / peak_memory_gb[0] * 100
    label = f"{val:.1f} GB\n(-{reduction:.0f}%)" if reduction > 0 else f"{val:.1f} GB"
    axes[1].text(
        bar.get_x() + bar.get_width() / 2,
        bar.get_height() + 0.5,
        label,
        ha="center",
        va="bottom",
        fontsize=9,
        fontweight="bold",
    )
axes[1].legend(fontsize=8)
axes[1].tick_params(axis="x", rotation=20)

plt.tight_layout()
plt.savefig("unsloth_benchmarks.png", dpi=120, bbox_inches="tight")
plt.show()

print("Key insight: Unsloth LoRA uses only 7.8 GB for 7B model training.")
print("This fits on a free Colab T4 (16 GB). Standard TRL needs 38.2 GB!")

# Cost comparison for fine-tuning a 7B model

print("7B Model Fine-Tuning Cost Comparison (2025 cloud GPU prices)")
print("=" * 70)
print()

scenarios = [
    {
        "name": "Standard TRL on A100 40GB",
        "gpu": "A100 40GB",
        "hours": 4.2,
        "cost_per_hr": 3.67,  # Lambda Labs A100
        "notes": "Minimum hardware that fits",
    },
    {
        "name": "Axolotl on A10G 24GB",
        "gpu": "A10G 24GB",
        "hours": 3.1,
        "cost_per_hr": 1.10,  # AWS g5.xlarge
        "notes": "Just barely fits",
    },
    {
        "name": "Unsloth on RTX 4090 (24GB)",
        "gpu": "RTX 4090",
        "hours": 1.8,
        "cost_per_hr": 0.74,  # Vast.ai RTX 4090
        "notes": "Consumer GPU works!",
    },
    {
        "name": "Unsloth on L4 (24GB)",
        "gpu": "L4 24GB",
        "hours": 2.1,
        "cost_per_hr": 0.81,  # Google Cloud L4
        "notes": "Good price/performance",
    },
    {
        "name": "Unsloth on Colab T4 (FREE)",
        "gpu": "T4 16GB",
        "hours": 3.5,
        "cost_per_hr": 0.0,
        "notes": "Free! (limited hours/day)",
    },
]

print(f"{'Scenario':<35} {'GPU':<15} {'Hours':>6} {'$/hr':>6} {'Total $':>8} {'Notes'}")
print("-" * 100)

for s in scenarios:
    total = s["hours"] * s["cost_per_hr"]
    cost_str = "FREE" if total == 0 else f"${total:.2f}"
    print(
        f"{s['name']:<35} {s['gpu']:<15} {s['hours']:>6.1f} "
        f"{s['cost_per_hr']:>6.2f} {cost_str:>8} {s['notes']}"
    )

print()
print("Assumes: 2000 training examples, 3 epochs, 2048 seq length")
print("Unsloth makes 24GB consumer GPUs viable for 7B fine-tuning")

7. GRPO with Unsloth for Reasoning Training¶

GRPO is even more memory-hungry than SFT because it generates G completions per step. Unsloth’s custom generation kernels make this practical on consumer hardware.

# GRPO with Unsloth - complete code

UNSLOTH_GRPO_COMPLETE = """
# ============================================================
# GRPO Reasoning Training with Unsloth - Complete Pipeline
# ============================================================
from unsloth import FastLanguageModel
from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset
import re, torch

# 1. Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-1.5B-Instruct",  # Small for demo
    max_seq_length=1024,
    load_in_4bit=True,
    fast_inference=True,   # Enable vLLM-compatible fast generation
)

# 2. Apply LoRA for GRPO
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    use_rslora=True,
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

# 3. Reward functions (from notebook 08)
def format_reward(completions, **kwargs):
    pattern = r"<think>.*?</think>\\s*<answer>.*?</answer>"
    return [1.0 if re.search(pattern, c, re.DOTALL) else 0.0 for c in completions]

def correctness_reward(completions, ground_truth, **kwargs):
    rewards = []
    for comp, gt in zip(completions, ground_truth):
        match = re.search(r"<answer>(.*?)</answer>", comp, re.DOTALL)
        if match:
            pred = match.group(1).strip()
            # Numeric comparison
            try:
                if abs(float(pred.replace(',','')) - float(str(gt).replace(',',''))) < 1e-6:
                    rewards.append(1.0); continue
            except ValueError:
                if pred.lower() == str(gt).lower():
                    rewards.append(1.0); continue
        rewards.append(0.0)
    return rewards

def combined_reward(completions, ground_truth, **kwargs):
    fmt = format_reward(completions)
    corr = correctness_reward(completions, ground_truth=ground_truth)
    return [0.2 * f + 0.8 * c for f, c in zip(fmt, corr)]

# 4. Load and format GSM8K
import re as _re
dataset = load_dataset("openai/gsm8k", "main", split="train")

def format_for_grpo(examples):
    prompts = []
    answers = []
    for q, a in zip(examples["question"], examples["answer"]):
        prompts.append(
            "<|im_start|>system\n"
            "Think inside <think>...</think> then answer in <answer>...</answer>.\n"
            "<|im_end|>\n"
            f"<|im_start|>user\n{q}<|im_end|>\n"
            "<|im_start|>assistant\n"
        )
        gt_match = _re.search(r"####\\s*([\\d,\\.]+)", a)
        answers.append(gt_match.group(1).replace(",", "") if gt_match else a.strip())
    return {"prompt": prompts, "ground_truth": answers}

formatted = dataset.map(format_for_grpo, batched=True, remove_columns=dataset.column_names)

# 5. GRPO config
grpo_config = GRPOConfig(
    output_dir="./unsloth-grpo-reasoning",
    num_train_epochs=1,
    per_device_train_batch_size=2,  # Unsloth allows 2x larger batch
    gradient_accumulation_steps=4,
    learning_rate=5e-7,
    bf16=True,
    num_generations=6,
    max_completion_length=512,
    max_prompt_length=256,
    beta=0.04,
    temperature=0.9,
    use_vllm=True,
    vllm_gpu_memory_utilization=0.5,
    logging_steps=5,
    save_steps=100,
)

# 6. Train
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=combined_reward,
    args=grpo_config,
    train_dataset=formatted,
)
trainer.train()
"""

print(UNSLOTH_GRPO_COMPLETE)

8. Extended Context Training with Unsloth¶

One of Unsloth’s most impressive capabilities: training with extremely long context windows.

Standard training at 32K+ tokens would OOM on almost any GPU. Unsloth makes it possible through a combination of:

Custom RoPE implementation that avoids intermediate tensor materializations
Gradient checkpointing optimized for long sequences
Chunked processing of long sequences during forward pass

# Extended context training example

EXTENDED_CONTEXT_CODE = """
# ============================================================
# Extended Context Training with Unsloth
# Train Llama 3.1 8B at 342K context on a single GPU!
# ============================================================
from unsloth import FastLanguageModel
import torch

# Standard fine-tuning context limits (approximate, varies by GPU):
# | GPU     | Standard TRL | Unsloth    |
# |---------|-------------|------------|
# | T4 16GB | 2K tokens   | 8K tokens  |
# | RTX 4090| 8K tokens   | 64K tokens |
# | A100 40G| 32K tokens  | 256K tokens|
# | A100 80G| 64K tokens  | 342K tokens|

# Llama 3.1 8B at 342K tokens
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B",
    max_seq_length=342_000,      # 342K tokens!
    load_in_4bit=True,
    # RoPE scaling is handled automatically
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    use_gradient_checkpointing="unsloth",
    # For very long sequences, use fewer target modules
)

# Use for:
# - Document summarization (legal, research papers)
# - Long-form code understanding (entire codebases)
# - Multi-document QA
# - Book-length text analysis

# Qwen 2.5 at 128K context (more practical)
model_128k, tokenizer_128k = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=131_072,  # 128K tokens
    load_in_4bit=True,
)
"""

print(EXTENDED_CONTEXT_CODE)

# Context length vs memory table
import numpy as np
import matplotlib.pyplot as plt

seq_lengths = [2048, 4096, 8192, 16384, 32768, 65536, 131072]
standard_trl_mem = [8, 12, 22, 42, 80, None, None]  # OOM beyond ~40K on A100 40GB
unsloth_mem = [7, 8, 10, 14, 20, 32, 56]

fig, ax = plt.subplots(figsize=(10, 5))
ax.set_title("Memory Usage vs Sequence Length - 8B Model (A100 80GB)", fontsize=12)

x_labels = [f"{s//1024}K" if s >= 1024 else str(s) for s in seq_lengths]
x = range(len(seq_lengths))

valid_std = [(i, m) for i, m in enumerate(standard_trl_mem) if m is not None]
xi, yi = zip(*valid_std)
ax.plot(xi, yi, "o-", color="#e74c3c", linewidth=2, markersize=8, label="Standard TRL")
ax.plot(x, unsloth_mem, "s-", color="#2ecc71", linewidth=2, markersize=8, label="Unsloth")

ax.axhline(y=80, color="gray", linestyle=":", alpha=0.7, label="A100 80GB limit")
ax.axhline(y=40, color="orange", linestyle=":", alpha=0.7, label="A100 40GB limit")
ax.axhline(y=16, color="red", linestyle=":", alpha=0.7, label="T4 free tier limit")

ax.annotate("OOM with standard TRL", xy=(4, 80), xytext=(4.5, 70),
            arrowprops=dict(arrowstyle="->"), fontsize=9, color="#e74c3c")

ax.set_xticks(x)
ax.set_xticklabels(x_labels)
ax.set_xlabel("Sequence Length")
ax.set_ylabel("Peak GPU Memory (GB)")
ax.legend(fontsize=9)
ax.grid(alpha=0.3)

plt.tight_layout()
plt.savefig("unsloth_context_memory.png", dpi=120, bbox_inches="tight")
plt.show()
print("Context memory chart saved.")

9. Saving to GGUF Format¶

GGUF is the format used by llama.cpp, Ollama, LM Studio, and Jan. Unsloth can export directly to GGUF without needing to install llama.cpp separately.

This is a major convenience feature - standard HuggingFace requires you to:

Merge the LoRA adapter
Install llama.cpp
Run the convert script
Run the quantize script

Unsloth does all four steps with one function call.

# Saving models with Unsloth - all formats

SAVING_CODE = """
# ============================================================
# Unsloth Model Saving - All Formats
# ============================================================

# 1. Save LoRA adapter only (smallest, ~50-500 MB)
model.save_pretrained("./my-lora-adapter")
tokenizer.save_pretrained("./my-lora-adapter")

# 2. Push LoRA adapter to HuggingFace Hub
model.push_to_hub("your-username/my-reasoning-adapter", token="hf_...")
tokenizer.push_to_hub("your-username/my-reasoning-adapter", token="hf_...")

# 3. Merge and save as full BF16 model (largest, best quality)
model.save_pretrained_merged(
    "./my-merged-model",
    tokenizer,
    save_method="merged_16bit",  # Full BF16 precision
)

# 4. Merge and save as 4-bit (good balance)
model.save_pretrained_merged(
    "./my-merged-4bit",
    tokenizer,
    save_method="merged_4bit_forced",  # 4-bit quantized
)

# 5. Export to GGUF (for llama.cpp, Ollama, LM Studio)
# Single quantization
model.save_pretrained_gguf(
    "./my-gguf-model",
    tokenizer,
    quantization_method="q4_k_m",   # Recommended: best quality/size
)

# Multiple GGUF quantizations at once
model.save_pretrained_gguf(
    "./my-gguf-all",
    tokenizer,
    quantization_method=["q4_k_m", "q8_0", "f16"],  # Multiple outputs
)

# 6. Push GGUF directly to HuggingFace Hub
model.push_to_hub_gguf(
    "your-username/my-model-gguf",
    tokenizer,
    quantization_method="q4_k_m",
    token="hf_...",
)
"""

print(SAVING_CODE)

# GGUF quantization methods explained
print("\n" + "=" * 65)
print("GGUF Quantization Methods Reference")
print("=" * 65)

gguf_methods = [
    ("q2_k",   "2-bit",  "Smallest, lowest quality",           "Extreme size limit"),
    ("q3_k_m", "3-bit",  "Very small, acceptable quality",     "Edge devices"),
    ("q4_0",   "4-bit",  "Small, decent quality",              "Legacy default"),
    ("q4_k_m", "4-bit",  "Best 4-bit quality (RECOMMENDED)",  "Most use cases"),
    ("q5_k_m", "5-bit",  "High quality, larger",              "Quality priority"),
    ("q6_k",   "6-bit",  "Very high quality, near lossless",  "Max quality 6-bit"),
    ("q8_0",   "8-bit",  "Near-lossless, large file",         "Near full precision"),
    ("f16",    "16-bit", "Full BF16, largest",                "Archive / inference server"),
]

print(f"{'Method':<10} {'Bits':<8} {'Quality':<40} {'Use Case'}")
print("-" * 80)
for method, bits, quality, use_case in gguf_methods:
    marker = " <-- RECOMMENDED" if method == "q4_k_m" else ""
    print(f"{method:<10} {bits:<8} {quality:<40} {use_case}{marker}")

10. Llama 4 and Qwen 3 Support in Unsloth (2025)¶

Unsloth added support for the major 2025 model releases quickly after their launch. These models have unique architectures that required specific kernel work.

# 2025 model support notes

NEW_MODELS_2025 = """
# ============================================================
# New 2025 Models in Unsloth
# ============================================================

# ---- Llama 4 (Meta, 2025) ----
# Architecture: Mixture of Experts (MoE)
# Scout: 17B active params, 109B total (16 experts)
# Maverick: 17B active params, 400B total (128 experts)
# Note: Requires Meta's access request at meta.com/llama

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-4-Scout-17B-16E-Instruct",
    max_seq_length=10_000_000,   # 10M token context!
    load_in_4bit=True,
    token="hf_...",             # Required: gated model
)
# Fine-tune with LoRA on active expert parameters only
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    # MoE layers (experts) are automatically handled
)

# ---- Qwen 3 (Alibaba, 2025) ----
# Key innovation: 'thinking_mode' for explicit reasoning
# /think and /no_think tokens control reasoning behavior
# QwQ integration for math reasoning

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-8B",
    max_seq_length=32768,
    load_in_4bit=True,
)

# Qwen 3 supports thinking mode in generation:
# prompt + "/think" -> enables chain-of-thought
# prompt + "/no_think" -> fast direct answer

# ---- Gemma 3 (Google, 2025) ----
# Improvements: multimodal (text+image), better reasoning
# 4B/12B/27B sizes available

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-3-12b-it",
    max_seq_length=8192,
    load_in_4bit=True,
)

# ---- Phi-4 Mini (Microsoft, 2025) ----
# Only 3.8B parameters, surprising quality for its size
# Excellent for on-device deployment

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Phi-4-mini-instruct",
    max_seq_length=16384,
    load_in_4bit=True,
)
"""

print(NEW_MODELS_2025)

11. Free Tier Strategy: Colab T4 and Kaggle P100¶

With Unsloth, free GPU tiers become genuinely useful for LLM fine-tuning.

Free GPU Resources (2025)¶

Platform	GPU	VRAM	Free Hours/Day	Best For
Google Colab (free)	T4	16 GB	~3-4 hrs	Quick experiments
Google Colab Pro	A100	40 GB	12-24 hrs	Serious training
Kaggle	P100/T4	16 GB	30 hrs/week	Reproducible notebooks
HuggingFace Spaces	T4 (ZeroGPU)	16 GB	Community	Demos

What Fits on a T4 (16 GB) with Unsloth¶

Model Size	Task	Max Seq Length	Training Time
1.5B	SFT	8192	~30 min
7B	SFT	2048	~2 hrs
7B	GRPO (G=4)	512	~3 hrs
14B	SFT	512	~4 hrs (tight)

# Colab T4 optimized configuration

COLAB_T4_CONFIG = """
# ============================================================
# Optimized for Google Colab T4 (16GB, FREE tier)
# ============================================================
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
import torch

# Check we're on T4
assert torch.cuda.is_available(), "GPU required!"
gpu_name = torch.cuda.get_device_name(0)
gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"GPU: {gpu_name} ({gpu_mem:.0f} GB)")

# T4-optimized model choice: 7B at 4-bit fits perfectly
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=2048,    # T4 limit for 7B
    load_in_4bit=True,
    dtype=torch.float16,   # T4 doesn't have BF16 hardware support
)

model = FastLanguageModel.get_peft_model(
    model,
    r=32,                  # Lower rank to save memory on T4
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # Fewer modules
    lora_alpha=64,
    use_gradient_checkpointing="unsloth",  # Essential on T4
    random_state=42,
)

# T4-friendly training args
training_args = TrainingArguments(
    output_dir="./t4-output",
    num_train_epochs=1,
    per_device_train_batch_size=1,  # Small batch for T4
    gradient_accumulation_steps=16, # Simulate larger batch
    learning_rate=2e-4,
    fp16=True,                      # T4 uses FP16 (not BF16)
    optim="adamw_8bit",
    logging_steps=5,
    save_steps=50,
    warmup_ratio=0.1,
    max_grad_norm=1.0,
    dataloader_num_workers=2,
)

# Colab tip: enable high RAM runtime
# Runtime -> Change runtime type -> High-RAM
# This gives 25 GB system RAM for data loading

# Anti-crash tip for Colab: save frequently
# Colab disconnects after ~90 min of inactivity
# Use save_steps=50 and resume_from_checkpoint=True
"""

KAGGLE_CONFIG = """
# ============================================================
# Optimized for Kaggle (P100/T4, 30 hrs/week FREE)
# ============================================================
# Kaggle advantages vs Colab:
# - More stable (no disconnects)
# - GPU: 30 hours per week total
# - Internet access enabled in settings
# - Persistent storage (up to 20 GB)

# Kaggle P100 (16GB, older but still useful)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
    dtype=torch.float16,  # P100 doesn't support BF16
)

# Install Unsloth on Kaggle:
# !pip install unsloth -q
# Enable internet in Kaggle notebook settings first

# Save to Kaggle output directory (persists across sessions)
model.save_pretrained("/kaggle/working/my-adapter")
"""

print("Colab T4 Configuration:")
print(COLAB_T4_CONFIG)
print("\nKaggle Configuration:")
print(KAGGLE_CONFIG)

12. Full End-to-End Example: Custom Chatbot in One Notebook¶

Putting it all together: fine-tune a custom chatbot from scratch to deployment.

# Complete pipeline from data to deployable model

COMPLETE_PIPELINE = """
# ============================================================
# Complete Unsloth Fine-Tuning Pipeline: Data -> Deployment
# ============================================================
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset, Dataset
import torch, json

# ---- STEP 1: Prepare your data ----
# Format: list of conversation dicts
training_data = [
    {
        "messages": [
            {"role": "system",    "content": "You are an expert Python tutor."},
            {"role": "user",      "content": "What is a list comprehension?"},
            {"role": "assistant", "content": "A list comprehension is a concise way..."}
        ]
    },
    # ... more examples
]

# Or load from HuggingFace
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:5000]")

def format_messages(example):
    text = ""
    for msg in example["messages"]:
        text += f"<|im_start|>{msg['role']}\\n{msg['content']}<|im_end|>\\n"
    return {"text": text}

dataset = dataset.map(format_messages, remove_columns=dataset.column_names)

# ---- STEP 2: Load model ----
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# ---- STEP 3: Configure LoRA ----
model = FastLanguageModel.get_peft_model(
    model, r=64, lora_alpha=128,
    target_modules="all-linear",
    use_rslora=True, use_gradient_checkpointing="unsloth",
)

# ---- STEP 4: Train ----
trainer = SFTTrainer(
    model=model, tokenizer=tokenizer,
    train_dataset=dataset, dataset_text_field="text",
    max_seq_length=2048, packing=True,
    args=TrainingArguments(
        output_dir="./output", num_train_epochs=3,
        per_device_train_batch_size=4, gradient_accumulation_steps=4,
        learning_rate=2e-4, bf16=True, optim="adamw_8bit",
        logging_steps=10, save_steps=100, warmup_ratio=0.1,
    )
)
trainer.train()

# ---- STEP 5: Test the model ----
FastLanguageModel.for_inference(model)  # Switch to inference mode

inputs = tokenizer([
    "<|im_start|>user\\nExplain recursion in Python<|im_end|>\\n<|im_start|>assistant\\n"
], return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=256, use_cache=True)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

# ---- STEP 6: Save for deployment ----
# Option A: LoRA adapter (for HuggingFace inference)
model.save_pretrained("./my-python-tutor")
tokenizer.save_pretrained("./my-python-tutor")

# Option B: GGUF for Ollama/LM Studio
model.save_pretrained_gguf("./my-python-tutor-gguf", tokenizer, "q4_k_m")

# Option C: Push to HuggingFace Hub
model.push_to_hub_gguf("username/my-python-tutor", tokenizer, "q4_k_m", token="hf_...")

# ---- DEPLOY WITH OLLAMA ----
# After saving GGUF:
# 1. Create Modelfile:
#    FROM ./my-python-tutor-gguf/model-unsloth.Q4_K_M.gguf
#    SYSTEM You are an expert Python tutor.
# 2. ollama create my-python-tutor -f Modelfile
# 3. ollama run my-python-tutor
"""

print(COMPLETE_PIPELINE)

# Inference with Unsloth-trained model
# FastLanguageModel.for_inference() enables 2x faster inference

INFERENCE_CODE = """
# ============================================================
# Fast Inference with Unsloth
# ============================================================
from unsloth import FastLanguageModel
import torch

# Load the fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./my-python-tutor",  # Your saved adapter
    max_seq_length=2048,
    load_in_4bit=True,
)

# CRITICAL: Switch to inference mode (disables LoRA training overhead)
FastLanguageModel.for_inference(model)

def chat(user_message: str, system_prompt: str = None, max_tokens: int = 512) -> str:
    """Simple chat function using the fine-tuned model."""
    sys = system_prompt or "You are a helpful AI assistant."
    prompt = (
        f"<|im_start|>system\\n{sys}<|im_end|>\\n"
        f"<|im_start|>user\\n{user_message}<|im_end|>\\n"
        "<|im_start|>assistant\\n"
    )
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
    with torch.inference_mode():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.7,
            do_sample=True,
            use_cache=True,           # Unsloth: always use KV cache
            pad_token_id=tokenizer.eos_token_id,
        )
    response = tokenizer.decode(
        outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
    )
    return response.strip()

# Test
response = chat("Explain list comprehensions in Python with examples.")
print(response)

# Streaming with Unsloth
from transformers import TextStreamer

streamer = TextStreamer(tokenizer, skip_prompt=True)
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
_ = model.generate(**inputs, max_new_tokens=256, streamer=streamer, use_cache=True)
"""

print(INFERENCE_CODE)

13. Performance Comparison Summary¶

Comprehensive benchmark: Unsloth vs Axolotl vs standard TRL across different tasks.

import matplotlib.pyplot as plt
import numpy as np

# Benchmark data (7B model, A100 80GB)
tasks = ["SFT 2K ctx", "SFT 8K ctx", "DPO", "GRPO (G=6)", "Full FT"]

# Tokens per second
standard_trl_tps = [1850, 920, 1200, 310, 890]
axolotl_tps = [2100, 1100, 1350, None, 980]
unsloth_tps = [5200, 3800, 3900, 1250, 4100]

# Memory (GB)
standard_trl_mem = [38.2, 62.1, 44.5, 71.3, 73.8]
axolotl_mem = [32.5, 54.0, 38.2, None, 68.5]
unsloth_mem = [7.8, 14.2, 11.4, 22.1, 42.3]

x = np.arange(len(tasks))
width = 0.28

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
fig.suptitle("Performance Comparison: Unsloth vs Axolotl vs Standard TRL (7B, A100 80GB)", fontsize=12)

# Speed
bars_std = ax1.bar(x - width, standard_trl_tps, width, label="Standard TRL", color="#e74c3c", alpha=0.85)
ax1_axo = [v if v else 0 for v in axolotl_tps]
bars_axo = ax1.bar(x, ax1_axo, width, label="Axolotl", color="#3498db", alpha=0.85)
bars_unsloth = ax1.bar(x + width, unsloth_tps, width, label="Unsloth", color="#2ecc71", alpha=0.85)

ax1.set_title("Training Speed (Tokens/sec, higher = better)")
ax1.set_xticks(x)
ax1.set_xticklabels(tasks, rotation=15, ha="right")
ax1.set_ylabel("Tokens per Second")
ax1.legend()
ax1.grid(axis="y", alpha=0.3)

# Add speedup annotations on Unsloth bars
for i, (bar, std_val) in enumerate(zip(bars_unsloth, standard_trl_tps)):
    if bar.get_height() > 0 and std_val > 0:
        speedup = bar.get_height() / std_val
        ax1.text(
            bar.get_x() + bar.get_width() / 2,
            bar.get_height() + 50,
            f"{speedup:.1f}x",
            ha="center", va="bottom", fontsize=8, fontweight="bold", color="#27ae60"
        )

# Memory
mem_axo = [v if v else 0 for v in axolotl_mem]
ax2.bar(x - width, standard_trl_mem, width, label="Standard TRL", color="#e74c3c", alpha=0.85)
ax2.bar(x, mem_axo, width, label="Axolotl", color="#3498db", alpha=0.85)
bars_mem = ax2.bar(x + width, unsloth_mem, width, label="Unsloth", color="#2ecc71", alpha=0.85)

ax2.axhline(y=80, color="gray", linestyle=":", alpha=0.6, label="A100 80GB")
ax2.axhline(y=40, color="orange", linestyle=":", alpha=0.6, label="A100 40GB")
ax2.axhline(y=16, color="red", linestyle=":", alpha=0.6, label="T4 free tier")

ax2.set_title("Peak GPU Memory (GB, lower = better)")
ax2.set_xticks(x)
ax2.set_xticklabels(tasks, rotation=15, ha="right")
ax2.set_ylabel("GPU Memory (GB)")
ax2.legend(fontsize=8)
ax2.grid(axis="y", alpha=0.3)

plt.tight_layout()
plt.savefig("unsloth_full_comparison.png", dpi=120, bbox_inches="tight")
plt.show()
print("Full comparison chart saved.")

14. Key Takeaways and Recommended Workflow¶

When to Use Unsloth¶

Always - unless you have a specific reason not to:

Same API as standard TRL/PEFT
Zero accuracy degradation
2-5x faster for free
Works with all 2025 SOTA models

The 2025 Unsloth Best Practices¶

Model loading: Use unsloth/ namespace models (pre-patched)
Quantization: load_in_4bit=True + Dynamic 2.0 for most cases
LoRA: target_modules="all-linear" + use_rslora=True + use_gradient_checkpointing="unsloth"
Training: packing=True, optim="adamw_8bit", bf16=True
Saving: save_pretrained_gguf for local deployment, push_to_hub_gguf for sharing
Inference: Always call FastLanguageModel.for_inference(model) before generating

Recommended Hardware Selection¶

Budget	Hardware	Best Model Choice
Free	Colab T4 / Kaggle P100	7B models
$1-2/hr	RTX 4090 (Vast.ai)	7B-14B models
$2-4/hr	A10G or L4	14B-32B models
$4-8/hr	A100 40GB	32B-70B models
$8-16/hr	A100 80GB	70B+ or long context

Quick Reference: Unsloth vs Standard TRL Decision¶

Do you have GPU memory to spare?  No  -> Use Unsloth
Do you need maximum speed?        Yes -> Use Unsloth
Are you on a free tier GPU?       Yes -> Use Unsloth
Do you need GGUF export?          Yes -> Use Unsloth
Are you training 7B+ models?      Yes -> Use Unsloth
Otherwise?                             -> Use Unsloth :)

Next Steps¶

Combine notebook 08 (GRPO) + this notebook (Unsloth) for the full R1-style training pipeline
Try fine-tuning on your own domain-specific dataset
Deploy your GGUF model with Ollama: ollama create my-model -f Modelfile
Scale up: use multi-GPU with accelerate launch (Unsloth is compatible)

# Quick reference: complete Unsloth workflow in one cell

UNSLOTH_QUICK_REF = """
# ============================================================
# Unsloth Complete Quick Reference
# ============================================================

# Install
# pip install unsloth

# Load
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Qwen2.5-7B-Instruct", max_seq_length=2048, load_in_4bit=True
)

# LoRA
model = FastLanguageModel.get_peft_model(
    model, r=64, lora_alpha=128,
    target_modules="all-linear",
    use_rslora=True,
    use_gradient_checkpointing="unsloth",
)

# Train (SFT)
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
    model=model, tokenizer=tokenizer,
    train_dataset=dataset, dataset_text_field="text",
    max_seq_length=2048, packing=True,
    args=TrainingArguments(
        output_dir="./out", num_train_epochs=3,
        per_device_train_batch_size=4, gradient_accumulation_steps=4,
        learning_rate=2e-4, bf16=True, optim="adamw_8bit",
    )
)
trainer.train()

# Inference
FastLanguageModel.for_inference(model)
outputs = model.generate(**tokenizer([prompt], return_tensors="pt").to("cuda"),
                         max_new_tokens=256, use_cache=True)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

# Save
model.save_pretrained("./adapter")  # LoRA adapter
model.save_pretrained_gguf("./gguf", tokenizer, "q4_k_m")  # GGUF
model.push_to_hub("username/model", token="hf_...")  # HF Hub
"""

print(UNSLOTH_QUICK_REF)