Unsloth - 2x-5x Faster Fine-Tuning with 80% Less VRAM (2025)ΒΆ

Unsloth is the fastest open-source LLM fine-tuning library in 2025. It achieves its speedups through hand-written custom CUDA kernels - not approximations - so you get the same model quality as standard TRL/PEFT, just much faster and cheaper.

What You Will LearnΒΆ

  • Why Unsloth is faster (custom kernels, not tricks)

  • Loading any 2025 model (Qwen 3, Llama 4, Phi-4, Gemma 3) with Unsloth

  • Dynamic 4-bit quantization: Unsloth Dynamic 2.0

  • Full SFT pipeline with UnslothTrainer

  • GRPO with Unsloth for 5x faster reasoning training

  • Extended context: 342K token training on Llama 3.1 8B

  • Saving to GGUF for Ollama/llama.cpp deployment

  • Memory and speed benchmarks vs standard TRL

  • Free tier strategies: Colab T4 and Kaggle P100

PrerequisitesΒΆ

  • Completed notebooks 02-05 (SFT, LoRA, QLoRA, DPO)

  • Notebook 08 (GRPO) recommended for the GRPO section

  • GPU with 8-16 GB VRAM (or free Colab/Kaggle)

# Install Unsloth (2025 - supports CUDA 12.x)
# !pip install unsloth
#
# For specific CUDA versions:
# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"  # Colab
# !pip install unsloth[cu121]    # CUDA 12.1
# !pip install unsloth[cu124]    # CUDA 12.4+
#
# Dependencies (usually auto-installed):
# !pip install xformers trl>=0.12.0 peft>=0.13.0 accelerate>=1.2.0

import torch
import sys

print(f"Python: {sys.version.split()[0]}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU: {gpu_name}")
    print(f"VRAM: {gpu_mem:.1f} GB")

    # Check if Unsloth supports this GPU
    if "T4" in gpu_name:
        print("Free tier GPU detected (Colab T4). Unsloth will use 16GB efficiently.")
    elif "P100" in gpu_name:
        print("Free tier GPU detected (Kaggle P100). Unsloth works great here.")
    elif "A100" in gpu_name or "A10" in gpu_name:
        print("Production GPU. Unsloth will provide maximum speedup.")
    elif "4090" in gpu_name or "3090" in gpu_name:
        print("Consumer flagship GPU. Unsloth excels here.")
    else:
        print("GPU detected. Unsloth supports most NVIDIA GPUs.")
else:
    print("No GPU detected. Unsloth requires a CUDA GPU.")

# Try importing Unsloth
try:
    from unsloth import FastLanguageModel
    print("\nUnsloth imported successfully!")
except ImportError:
    print("\nUnsloth not installed. Run: pip install unsloth")
    print("Showing code examples that will work once installed.")

1. What Is Unsloth and Why It MattersΒΆ

The Standard Fine-Tuning BottleneckΒΆ

Standard QLoRA (bitsandbytes + PEFT + TRL) is great but has overhead:

  • HuggingFace attention uses generic PyTorch - not optimized for your exact hardware

  • Gradient checkpointing has Python-level overhead per operation

  • The LoRA matrix math uses standard cuBLAS calls

How Unsloth Is DifferentΒΆ

Unsloth rewrites the critical path in custom CUDA/Triton kernels:

  1. Custom attention kernel: RoPE, attention computation, and softmax fused into one kernel

  2. Fused LoRA operations: The A/B matrix multiplications and addition are fused

  3. Custom gradient checkpointing: Python overhead removed, pure CUDA

  4. Smarter memory management: Avoids unnecessary tensor copies during backprop

This is not approximation. It is the exact same math, done more efficiently.

Key Advantages Over CompetitorsΒΆ

Feature

Unsloth

Standard TRL

Axolotl

Speed (vs baseline)

2-5x faster

1x (baseline)

~1.5x

Memory reduction

60-80% less

baseline

~20%

Accuracy vs baseline

Identical

Identical

Identical

AMD GPU support

Yes (ROCm 2025)

Yes

Yes

Intel GPU support

Yes (2025)

Partial

No

GGUF export

Built-in

Manual

Manual

Setup complexity

Low (pip install)

Medium

High (YAML config)

Free tier compatible

Yes

Partial

No

AMD and Intel GPU Support (2025)ΒΆ

Unsloth 2025 expanded beyond NVIDIA:

  • AMD ROCm 6.x: RX 7900 XTX, MI300X supported

  • Intel Arc / Gaudi 3: via Intel Extension for PyTorch

  • Same API, kernels automatically dispatch to the right backend

2. Loading Models with UnslothΒΆ

FastLanguageModel.from_pretrained is a drop-in replacement for AutoModelForCausalLM.from_pretrained + BitsAndBytesConfig.

It handles quantization, device mapping, and Unsloth kernel patching in one call.

# ============================================================
# Loading models with Unsloth (2025 supported models)
# ============================================================

# Unsloth hosts pre-patched versions of popular models on HuggingFace
# using the "unsloth/" namespace - these load faster than the originals.

SUPPORTED_MODELS_2025 = {
    # Qwen 2.5 family (recommended for most tasks)
    "qwen2.5_1.5b": "unsloth/Qwen2.5-1.5B-Instruct",
    "qwen2.5_7b":   "unsloth/Qwen2.5-7B-Instruct",
    "qwen2.5_14b":  "unsloth/Qwen2.5-14B-Instruct",
    "qwen2.5_72b":  "unsloth/Qwen2.5-72B-Instruct-bnb-4bit",

    # Qwen 3 family (2025 - best reasoning + multilingual)
    "qwen3_4b":     "unsloth/Qwen3-4B",
    "qwen3_8b":     "unsloth/Qwen3-8B",
    "qwen3_30b":    "unsloth/Qwen3-30B-A3B",  # MoE: 30B total, 3B active

    # Llama 3.x family
    "llama3.2_1b":  "unsloth/Llama-3.2-1B-Instruct",
    "llama3.2_3b":  "unsloth/Llama-3.2-3B-Instruct",
    "llama3.3_70b": "unsloth/Llama-3.3-70B-Instruct-bnb-4bit",

    # Llama 4 (2025)
    "llama4_scout": "unsloth/Llama-4-Scout-17B-16E",  # 17B active of 109B
    "llama4_maverick": "unsloth/Llama-4-Maverick-17B-128E",

    # Phi-4 (Microsoft)
    "phi4":         "unsloth/phi-4",
    "phi4_mini":    "unsloth/Phi-4-mini-instruct",

    # Gemma 3 (Google, 2025)
    "gemma3_4b":    "unsloth/gemma-3-4b-it",
    "gemma3_12b":   "unsloth/gemma-3-12b-it",
    "gemma3_27b":   "unsloth/gemma-3-27b-it",

    # DeepSeek R1 family (reasoning)
    "deepseek_r1_7b":  "unsloth/DeepSeek-R1-Distill-Qwen-7B",
    "deepseek_r1_70b": "unsloth/DeepSeek-R1-Distill-Llama-70B-bnb-4bit",
}

print("Unsloth-supported models (2025):")
print()
for key, model_id in SUPPORTED_MODELS_2025.items():
    print(f"  {key:<20} -> {model_id}")
# ============================================================
# Standard Unsloth model loading
# ============================================================

# from unsloth import FastLanguageModel

MODEL_NAME = "unsloth/Qwen2.5-7B-Instruct"
MAX_SEQ_LENGTH = 2048

UNSLOTH_LOAD_CODE = f'''
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="{MODEL_NAME}",
    max_seq_length={MAX_SEQ_LENGTH},      # Context window
    load_in_4bit=True,                   # 4-bit quantization
    # load_in_8bit=False,                # 8-bit (slower, more accurate)
    # full_finetuning=False,             # Set True for full fine-tuning
    dtype=None,                          # Auto-detect (BF16 on Ampere+)
    # token="hf_...",                    # For gated models (Llama 4)
)

print(f"Model loaded: {MODEL_NAME}")
print(f"Memory: {{model.get_memory_footprint() / 1e9:.2f}} GB")
# 7B with 4-bit: ~5 GB (vs 14 GB for FP16)
'''

print("Standard Unsloth loading code:")
print(UNSLOTH_LOAD_CODE)

# Demonstrate standard loading with bitsandbytes for comparison
print("\n" + "="*60)
print("Standard HuggingFace equivalent (for comparison):")
print("="*60)

HF_EQUIVALENT = """
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Then separately: prepare_model_for_kbit_training, get_peft_model...
# Unsloth combines all of this into one call
"""
print(HF_EQUIVALENT)

3. Unsloth Dynamic 2.0 - Smarter QuantizationΒΆ

Standard NF4 quantization applies 4-bit compression uniformly to all layers. Unsloth Dynamic 2.0 (released 2025) is smarter:

How Dynamic Quantization WorksΒΆ

Not all layers are equally sensitive to quantization:

  • Embedding layers: Very sensitive. 4-bit here hurts accuracy significantly.

  • First/last transformer layers: More sensitive than middle layers.

  • Attention vs MLP: Attention layers are more sensitive.

Unsloth Dynamic 2.0 keeps sensitive layers in higher precision (8-bit or 16-bit) and quantizes insensitive layers to 4-bit. The result:

Approach

Accuracy

Memory

Speed

BF16 (full precision)

Baseline

100%

1x

Uniform NF4 4-bit

-2.5% avg

35%

2x

Unsloth Dynamic 2.0

-0.8% avg

38%

1.9x

Dynamic 2.0 gives nearly full-precision accuracy at 4-bit memory cost.

Activating it is one parameter change:

# Dynamic 2.0 quantization example

DYNAMIC_QUANT_CODE = """
from unsloth import FastLanguageModel

# Option 1: Unsloth Dynamic 2.0 (best accuracy at 4-bit)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
    # Unsloth Dynamic 2.0: automatically skips critical layers
    # Enable by using the 'bnb-4bit' model variants on unsloth/ namespace
)

# Option 2: Unsloth's GGUF-based quantization (best for inference)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct-GGUF",
    max_seq_length=2048,
    # gguf_quant_type="Q4_K_M",  # GGUF quantization type
)

# Option 3: Full precision (best quality, most memory)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=2048,
    load_in_4bit=False,  # BF16 full precision
    full_finetuning=True,  # Enable full parameter fine-tuning
)
"""

print(DYNAMIC_QUANT_CODE)

# Memory comparison table
import pandas as pd

memory_data = {
    "Model": ["1.5B", "7B", "14B", "32B", "70B"],
    "BF16 (full)": ["3 GB", "14 GB", "28 GB", "64 GB", "140 GB"],
    "NF4 4-bit (bnb)": ["1 GB", "5 GB", "10 GB", "20 GB", "40 GB"],
    "Unsloth Dynamic 2.0": ["1.1 GB", "5.2 GB", "10.5 GB", "21 GB", "42 GB"],
    "With LoRA gradient": ["3 GB", "10 GB", "18 GB", "38 GB", "65 GB"],
    "Free tier fits?": ["T4 yes", "T4 yes", "A10 yes", "A100 only", "2xA100"],
}

df = pd.DataFrame(memory_data)
print("\nUnsloth Memory Requirements (training with LoRA):")
print(df.to_string(index=False))

4. LoRA Configuration with UnslothΒΆ

FastLanguageModel.get_peft_model replaces the PEFT get_peft_model + LoraConfig pattern. It applies Unsloth’s custom LoRA kernels automatically.

# LoRA with Unsloth - full example with explanations

UNSLOTH_LORA_CODE = """
from unsloth import FastLanguageModel

# Step 1: Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Step 2: Apply LoRA
model = FastLanguageModel.get_peft_model(
    model,

    # LoRA hyperparameters (2025 best practices)
    r=64,                          # Rank. 64 is standard in 2025.
    lora_alpha=128,                # Scaling (2x rank)
    target_modules=[
        "q_proj", "k_proj",        # Query and Key attention
        "v_proj", "o_proj",        # Value and Output attention
        "gate_proj", "up_proj",    # MLP layers
        "down_proj",
    ],
    # OR use 'all-linear' to target all linear layers automatically:
    # target_modules="all-linear",

    lora_dropout=0.0,              # Unsloth recommends 0 dropout
    bias="none",
    use_rslora=True,               # Rank-Stabilized LoRA
    use_gradient_checkpointing="unsloth",  # Unsloth's custom checkpointing
    random_state=42,

    # Optional: LoRA for longer context
    # loftq_config=None,           # LoftQ initialization
)

# Check parameters
model.print_trainable_parameters()
# Example output: trainable params: 39,976,960 || all params: 7,241,732,096 || trainable%: 0.55
"""

print("Unsloth LoRA setup:")
print(UNSLOTH_LORA_CODE)

# Compare LoRA setup: Standard PEFT vs Unsloth
STANDARD_LORA = """
# Standard PEFT approach (more verbose):
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

lora_config = LoraConfig(
    r=64,
    lora_alpha=128,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    use_rslora=True,
)
model = prepare_model_for_kbit_training(model)  # Extra step needed
model = get_peft_model(model, lora_config)
# Gradient checkpointing must be set separately
model.gradient_checkpointing_enable()
"""

print("\nStandard PEFT (for comparison):")
print(STANDARD_LORA)

5. Full SFT Pipeline with UnslothTrainerΒΆ

Unsloth’s UnslothTrainer is a drop-in replacement for TRL’s SFTTrainer. Every feature of SFTTrainer works, plus Unsloth’s speedups apply automatically.

from datasets import load_dataset

# Load a dataset for demonstration
print("Loading dataset...")
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:2000]")
print(f"Loaded {len(dataset)} training examples")
print(f"Columns: {dataset.column_names}")
print(f"\nExample:\n{str(dataset[0]['messages'][:2])[:200]}...")
# Format dataset for instruction tuning

def convert_messages_to_text(example):
    """
    Convert chat messages to ChatML format text.
    Unsloth's SFTTrainer can also handle raw messages with apply_chat_template.
    """
    text = ""
    for msg in example["messages"]:
        role = msg["role"]
        content = msg["content"]
        text += f"<|im_start|>{role}\n{content}<|im_end|>\n"
    return {"text": text}


formatted_dataset = dataset.map(
    convert_messages_to_text,
    remove_columns=dataset.column_names,
)

print(f"Formatted dataset: {len(formatted_dataset)} examples")
print(f"\nSample (first 300 chars):\n{formatted_dataset[0]['text'][:300]}...")
# Complete Unsloth SFT training pipeline

UNSLOTH_SFT_FULL = """
# ============================================================
# Complete Unsloth SFT Pipeline
# ============================================================
from unsloth import FastLanguageModel
from unsloth import UnslothTrainer, UnslothTrainingArguments
from trl import SFTTrainer  # Or use UnslothTrainer for auto-optimizations
from transformers import TrainingArguments
from datasets import load_dataset

# 1. Load model with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# 2. Apply LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    lora_alpha=128,
    target_modules="all-linear",
    use_rslora=True,
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

# 3. Training arguments
training_args = TrainingArguments(
    output_dir="./unsloth-sft-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,   # Unsloth allows larger batches
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    bf16=True,
    optim="adamw_8bit",             # Unsloth's 8-bit Adam
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    logging_steps=10,
    save_steps=100,
    save_total_limit=2,
    report_to="none",
    max_grad_norm=1.0,
)

# 4. Trainer (SFTTrainer works with Unsloth model)
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted_dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    packing=True,                   # Essential for speed
    args=training_args,
)

# 5. Show memory before training
gpu_stats = torch.cuda.get_device_properties(0)
start_mem = torch.cuda.max_memory_reserved() / 1e9
max_mem = gpu_stats.total_memory / 1e9
print(f"GPU: {gpu_stats.name} ({max_mem:.1f} GB)")
print(f"Reserved: {start_mem:.2f} GB / {max_mem:.1f} GB")

# 6. Train!
trainer_stats = trainer.train()

# 7. Report stats
end_mem = torch.cuda.max_memory_reserved() / 1e9
print(f"\nTraining complete!")
print(f"Time: {trainer_stats.metrics['train_runtime']:.1f}s")
print(f"Tokens/sec: {trainer_stats.metrics['train_samples_per_second']:.1f}")
print(f"Peak memory used: {end_mem:.2f} GB / {max_mem:.1f} GB")
"""

print(UNSLOTH_SFT_FULL)

6. Speed and Memory BenchmarksΒΆ

Here are real-world benchmarks comparing Unsloth, standard TRL, and Axolotl. Numbers are from Unsloth’s official benchmarks and community reproductions (2025).

import matplotlib.pyplot as plt
import numpy as np

# Benchmark data: 7B model SFT, batch_size=1, seq_len=2048, A100 80GB
frameworks = ["Standard TRL", "Axolotl", "Unsloth (LoRA)", "Unsloth (Full)"]

# Tokens per second (higher = faster)
tokens_per_sec = [1850, 2100, 5200, 4800]

# Peak GPU memory in GB (lower = better)
peak_memory_gb = [38.2, 32.5, 7.8, 14.2]

colors = ["#95a5a6", "#3498db", "#2ecc71", "#27ae60"]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle("Unsloth vs Alternatives - 7B Model SFT (A100 80GB)", fontsize=13)

# Speed comparison
bars1 = axes[0].bar(frameworks, tokens_per_sec, color=colors, edgecolor="white", linewidth=1.5)
axes[0].set_title("Training Speed")
axes[0].set_ylabel("Tokens per Second")
axes[0].set_ylim(0, 6500)
for bar, val in zip(bars1, tokens_per_sec):
    speedup = val / tokens_per_sec[0]
    axes[0].text(
        bar.get_x() + bar.get_width() / 2,
        bar.get_height() + 100,
        f"{val:,}\n({speedup:.1f}x)",
        ha="center",
        va="bottom",
        fontsize=9,
        fontweight="bold",
    )
axes[0].tick_params(axis="x", rotation=20)

# Memory comparison
bars2 = axes[1].bar(frameworks, peak_memory_gb, color=colors, edgecolor="white", linewidth=1.5)
axes[1].set_title("Peak GPU Memory")
axes[1].set_ylabel("GB")
axes[1].set_ylim(0, 48)
axes[1].axhline(y=16, color="red", linestyle="--", alpha=0.7, label="Free tier GPU limit (16GB)")
axes[1].axhline(y=40, color="orange", linestyle="--", alpha=0.5, label="A100 40GB")
for bar, val in zip(bars2, peak_memory_gb):
    reduction = (peak_memory_gb[0] - val) / peak_memory_gb[0] * 100
    label = f"{val:.1f} GB\n(-{reduction:.0f}%)" if reduction > 0 else f"{val:.1f} GB"
    axes[1].text(
        bar.get_x() + bar.get_width() / 2,
        bar.get_height() + 0.5,
        label,
        ha="center",
        va="bottom",
        fontsize=9,
        fontweight="bold",
    )
axes[1].legend(fontsize=8)
axes[1].tick_params(axis="x", rotation=20)

plt.tight_layout()
plt.savefig("unsloth_benchmarks.png", dpi=120, bbox_inches="tight")
plt.show()

print("Key insight: Unsloth LoRA uses only 7.8 GB for 7B model training.")
print("This fits on a free Colab T4 (16 GB). Standard TRL needs 38.2 GB!")
# Cost comparison for fine-tuning a 7B model

print("7B Model Fine-Tuning Cost Comparison (2025 cloud GPU prices)")
print("=" * 70)
print()

scenarios = [
    {
        "name": "Standard TRL on A100 40GB",
        "gpu": "A100 40GB",
        "hours": 4.2,
        "cost_per_hr": 3.67,  # Lambda Labs A100
        "notes": "Minimum hardware that fits",
    },
    {
        "name": "Axolotl on A10G 24GB",
        "gpu": "A10G 24GB",
        "hours": 3.1,
        "cost_per_hr": 1.10,  # AWS g5.xlarge
        "notes": "Just barely fits",
    },
    {
        "name": "Unsloth on RTX 4090 (24GB)",
        "gpu": "RTX 4090",
        "hours": 1.8,
        "cost_per_hr": 0.74,  # Vast.ai RTX 4090
        "notes": "Consumer GPU works!",
    },
    {
        "name": "Unsloth on L4 (24GB)",
        "gpu": "L4 24GB",
        "hours": 2.1,
        "cost_per_hr": 0.81,  # Google Cloud L4
        "notes": "Good price/performance",
    },
    {
        "name": "Unsloth on Colab T4 (FREE)",
        "gpu": "T4 16GB",
        "hours": 3.5,
        "cost_per_hr": 0.0,
        "notes": "Free! (limited hours/day)",
    },
]

print(f"{'Scenario':<35} {'GPU':<15} {'Hours':>6} {'$/hr':>6} {'Total $':>8} {'Notes'}")
print("-" * 100)

for s in scenarios:
    total = s["hours"] * s["cost_per_hr"]
    cost_str = "FREE" if total == 0 else f"${total:.2f}"
    print(
        f"{s['name']:<35} {s['gpu']:<15} {s['hours']:>6.1f} "
        f"{s['cost_per_hr']:>6.2f} {cost_str:>8} {s['notes']}"
    )

print()
print("Assumes: 2000 training examples, 3 epochs, 2048 seq length")
print("Unsloth makes 24GB consumer GPUs viable for 7B fine-tuning")

7. GRPO with Unsloth for Reasoning TrainingΒΆ

GRPO is even more memory-hungry than SFT because it generates G completions per step. Unsloth’s custom generation kernels make this practical on consumer hardware.

# GRPO with Unsloth - complete code

UNSLOTH_GRPO_COMPLETE = """
# ============================================================
# GRPO Reasoning Training with Unsloth - Complete Pipeline
# ============================================================
from unsloth import FastLanguageModel
from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset
import re, torch

# 1. Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-1.5B-Instruct",  # Small for demo
    max_seq_length=1024,
    load_in_4bit=True,
    fast_inference=True,   # Enable vLLM-compatible fast generation
)

# 2. Apply LoRA for GRPO
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    use_rslora=True,
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

# 3. Reward functions (from notebook 08)
def format_reward(completions, **kwargs):
    pattern = r"<think>.*?</think>\\s*<answer>.*?</answer>"
    return [1.0 if re.search(pattern, c, re.DOTALL) else 0.0 for c in completions]

def correctness_reward(completions, ground_truth, **kwargs):
    rewards = []
    for comp, gt in zip(completions, ground_truth):
        match = re.search(r"<answer>(.*?)</answer>", comp, re.DOTALL)
        if match:
            pred = match.group(1).strip()
            # Numeric comparison
            try:
                if abs(float(pred.replace(',','')) - float(str(gt).replace(',',''))) < 1e-6:
                    rewards.append(1.0); continue
            except ValueError:
                if pred.lower() == str(gt).lower():
                    rewards.append(1.0); continue
        rewards.append(0.0)
    return rewards

def combined_reward(completions, ground_truth, **kwargs):
    fmt = format_reward(completions)
    corr = correctness_reward(completions, ground_truth=ground_truth)
    return [0.2 * f + 0.8 * c for f, c in zip(fmt, corr)]

# 4. Load and format GSM8K
import re as _re
dataset = load_dataset("openai/gsm8k", "main", split="train")

def format_for_grpo(examples):
    prompts = []
    answers = []
    for q, a in zip(examples["question"], examples["answer"]):
        prompts.append(
            "<|im_start|>system\n"
            "Think inside <think>...</think> then answer in <answer>...</answer>.\n"
            "<|im_end|>\n"
            f"<|im_start|>user\n{q}<|im_end|>\n"
            "<|im_start|>assistant\n"
        )
        gt_match = _re.search(r"####\\s*([\\d,\\.]+)", a)
        answers.append(gt_match.group(1).replace(",", "") if gt_match else a.strip())
    return {"prompt": prompts, "ground_truth": answers}

formatted = dataset.map(format_for_grpo, batched=True, remove_columns=dataset.column_names)

# 5. GRPO config
grpo_config = GRPOConfig(
    output_dir="./unsloth-grpo-reasoning",
    num_train_epochs=1,
    per_device_train_batch_size=2,  # Unsloth allows 2x larger batch
    gradient_accumulation_steps=4,
    learning_rate=5e-7,
    bf16=True,
    num_generations=6,
    max_completion_length=512,
    max_prompt_length=256,
    beta=0.04,
    temperature=0.9,
    use_vllm=True,
    vllm_gpu_memory_utilization=0.5,
    logging_steps=5,
    save_steps=100,
)

# 6. Train
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=combined_reward,
    args=grpo_config,
    train_dataset=formatted,
)
trainer.train()
"""

print(UNSLOTH_GRPO_COMPLETE)

8. Extended Context Training with UnslothΒΆ

One of Unsloth’s most impressive capabilities: training with extremely long context windows.

Standard training at 32K+ tokens would OOM on almost any GPU. Unsloth makes it possible through a combination of:

  • Custom RoPE implementation that avoids intermediate tensor materializations

  • Gradient checkpointing optimized for long sequences

  • Chunked processing of long sequences during forward pass

# Extended context training example

EXTENDED_CONTEXT_CODE = """
# ============================================================
# Extended Context Training with Unsloth
# Train Llama 3.1 8B at 342K context on a single GPU!
# ============================================================
from unsloth import FastLanguageModel
import torch

# Standard fine-tuning context limits (approximate, varies by GPU):
# | GPU     | Standard TRL | Unsloth    |
# |---------|-------------|------------|
# | T4 16GB | 2K tokens   | 8K tokens  |
# | RTX 4090| 8K tokens   | 64K tokens |
# | A100 40G| 32K tokens  | 256K tokens|
# | A100 80G| 64K tokens  | 342K tokens|

# Llama 3.1 8B at 342K tokens
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B",
    max_seq_length=342_000,      # 342K tokens!
    load_in_4bit=True,
    # RoPE scaling is handled automatically
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    use_gradient_checkpointing="unsloth",
    # For very long sequences, use fewer target modules
)

# Use for:
# - Document summarization (legal, research papers)
# - Long-form code understanding (entire codebases)
# - Multi-document QA
# - Book-length text analysis

# Qwen 2.5 at 128K context (more practical)
model_128k, tokenizer_128k = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=131_072,  # 128K tokens
    load_in_4bit=True,
)
"""

print(EXTENDED_CONTEXT_CODE)

# Context length vs memory table
import numpy as np
import matplotlib.pyplot as plt

seq_lengths = [2048, 4096, 8192, 16384, 32768, 65536, 131072]
standard_trl_mem = [8, 12, 22, 42, 80, None, None]  # OOM beyond ~40K on A100 40GB
unsloth_mem = [7, 8, 10, 14, 20, 32, 56]

fig, ax = plt.subplots(figsize=(10, 5))
ax.set_title("Memory Usage vs Sequence Length - 8B Model (A100 80GB)", fontsize=12)

x_labels = [f"{s//1024}K" if s >= 1024 else str(s) for s in seq_lengths]
x = range(len(seq_lengths))

valid_std = [(i, m) for i, m in enumerate(standard_trl_mem) if m is not None]
xi, yi = zip(*valid_std)
ax.plot(xi, yi, "o-", color="#e74c3c", linewidth=2, markersize=8, label="Standard TRL")
ax.plot(x, unsloth_mem, "s-", color="#2ecc71", linewidth=2, markersize=8, label="Unsloth")

ax.axhline(y=80, color="gray", linestyle=":", alpha=0.7, label="A100 80GB limit")
ax.axhline(y=40, color="orange", linestyle=":", alpha=0.7, label="A100 40GB limit")
ax.axhline(y=16, color="red", linestyle=":", alpha=0.7, label="T4 free tier limit")

ax.annotate("OOM with standard TRL", xy=(4, 80), xytext=(4.5, 70),
            arrowprops=dict(arrowstyle="->"), fontsize=9, color="#e74c3c")

ax.set_xticks(x)
ax.set_xticklabels(x_labels)
ax.set_xlabel("Sequence Length")
ax.set_ylabel("Peak GPU Memory (GB)")
ax.legend(fontsize=9)
ax.grid(alpha=0.3)

plt.tight_layout()
plt.savefig("unsloth_context_memory.png", dpi=120, bbox_inches="tight")
plt.show()
print("Context memory chart saved.")

9. Saving to GGUF FormatΒΆ

GGUF is the format used by llama.cpp, Ollama, LM Studio, and Jan. Unsloth can export directly to GGUF without needing to install llama.cpp separately.

This is a major convenience feature - standard HuggingFace requires you to:

  1. Merge the LoRA adapter

  2. Install llama.cpp

  3. Run the convert script

  4. Run the quantize script

Unsloth does all four steps with one function call.

# Saving models with Unsloth - all formats

SAVING_CODE = """
# ============================================================
# Unsloth Model Saving - All Formats
# ============================================================

# 1. Save LoRA adapter only (smallest, ~50-500 MB)
model.save_pretrained("./my-lora-adapter")
tokenizer.save_pretrained("./my-lora-adapter")

# 2. Push LoRA adapter to HuggingFace Hub
model.push_to_hub("your-username/my-reasoning-adapter", token="hf_...")
tokenizer.push_to_hub("your-username/my-reasoning-adapter", token="hf_...")

# 3. Merge and save as full BF16 model (largest, best quality)
model.save_pretrained_merged(
    "./my-merged-model",
    tokenizer,
    save_method="merged_16bit",  # Full BF16 precision
)

# 4. Merge and save as 4-bit (good balance)
model.save_pretrained_merged(
    "./my-merged-4bit",
    tokenizer,
    save_method="merged_4bit_forced",  # 4-bit quantized
)

# 5. Export to GGUF (for llama.cpp, Ollama, LM Studio)
# Single quantization
model.save_pretrained_gguf(
    "./my-gguf-model",
    tokenizer,
    quantization_method="q4_k_m",   # Recommended: best quality/size
)

# Multiple GGUF quantizations at once
model.save_pretrained_gguf(
    "./my-gguf-all",
    tokenizer,
    quantization_method=["q4_k_m", "q8_0", "f16"],  # Multiple outputs
)

# 6. Push GGUF directly to HuggingFace Hub
model.push_to_hub_gguf(
    "your-username/my-model-gguf",
    tokenizer,
    quantization_method="q4_k_m",
    token="hf_...",
)
"""

print(SAVING_CODE)

# GGUF quantization methods explained
print("\n" + "=" * 65)
print("GGUF Quantization Methods Reference")
print("=" * 65)

gguf_methods = [
    ("q2_k",   "2-bit",  "Smallest, lowest quality",           "Extreme size limit"),
    ("q3_k_m", "3-bit",  "Very small, acceptable quality",     "Edge devices"),
    ("q4_0",   "4-bit",  "Small, decent quality",              "Legacy default"),
    ("q4_k_m", "4-bit",  "Best 4-bit quality (RECOMMENDED)",  "Most use cases"),
    ("q5_k_m", "5-bit",  "High quality, larger",              "Quality priority"),
    ("q6_k",   "6-bit",  "Very high quality, near lossless",  "Max quality 6-bit"),
    ("q8_0",   "8-bit",  "Near-lossless, large file",         "Near full precision"),
    ("f16",    "16-bit", "Full BF16, largest",                "Archive / inference server"),
]

print(f"{'Method':<10} {'Bits':<8} {'Quality':<40} {'Use Case'}")
print("-" * 80)
for method, bits, quality, use_case in gguf_methods:
    marker = " <-- RECOMMENDED" if method == "q4_k_m" else ""
    print(f"{method:<10} {bits:<8} {quality:<40} {use_case}{marker}")

10. Llama 4 and Qwen 3 Support in Unsloth (2025)ΒΆ

Unsloth added support for the major 2025 model releases quickly after their launch. These models have unique architectures that required specific kernel work.

# 2025 model support notes

NEW_MODELS_2025 = """
# ============================================================
# New 2025 Models in Unsloth
# ============================================================

# ---- Llama 4 (Meta, 2025) ----
# Architecture: Mixture of Experts (MoE)
# Scout: 17B active params, 109B total (16 experts)
# Maverick: 17B active params, 400B total (128 experts)
# Note: Requires Meta's access request at meta.com/llama

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-4-Scout-17B-16E-Instruct",
    max_seq_length=10_000_000,   # 10M token context!
    load_in_4bit=True,
    token="hf_...",             # Required: gated model
)
# Fine-tune with LoRA on active expert parameters only
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    # MoE layers (experts) are automatically handled
)

# ---- Qwen 3 (Alibaba, 2025) ----
# Key innovation: 'thinking_mode' for explicit reasoning
# /think and /no_think tokens control reasoning behavior
# QwQ integration for math reasoning

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-8B",
    max_seq_length=32768,
    load_in_4bit=True,
)

# Qwen 3 supports thinking mode in generation:
# prompt + "/think" -> enables chain-of-thought
# prompt + "/no_think" -> fast direct answer

# ---- Gemma 3 (Google, 2025) ----
# Improvements: multimodal (text+image), better reasoning
# 4B/12B/27B sizes available

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-3-12b-it",
    max_seq_length=8192,
    load_in_4bit=True,
)

# ---- Phi-4 Mini (Microsoft, 2025) ----
# Only 3.8B parameters, surprising quality for its size
# Excellent for on-device deployment

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Phi-4-mini-instruct",
    max_seq_length=16384,
    load_in_4bit=True,
)
"""

print(NEW_MODELS_2025)

11. Free Tier Strategy: Colab T4 and Kaggle P100ΒΆ

With Unsloth, free GPU tiers become genuinely useful for LLM fine-tuning.

Free GPU Resources (2025)ΒΆ

Platform

GPU

VRAM

Free Hours/Day

Best For

Google Colab (free)

T4

16 GB

~3-4 hrs

Quick experiments

Google Colab Pro

A100

40 GB

12-24 hrs

Serious training

Kaggle

P100/T4

16 GB

30 hrs/week

Reproducible notebooks

HuggingFace Spaces

T4 (ZeroGPU)

16 GB

Community

Demos

What Fits on a T4 (16 GB) with UnslothΒΆ

Model Size

Task

Max Seq Length

Training Time

1.5B

SFT

8192

~30 min

7B

SFT

2048

~2 hrs

7B

GRPO (G=4)

512

~3 hrs

14B

SFT

512

~4 hrs (tight)

# Colab T4 optimized configuration

COLAB_T4_CONFIG = """
# ============================================================
# Optimized for Google Colab T4 (16GB, FREE tier)
# ============================================================
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
import torch

# Check we're on T4
assert torch.cuda.is_available(), "GPU required!"
gpu_name = torch.cuda.get_device_name(0)
gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"GPU: {gpu_name} ({gpu_mem:.0f} GB)")

# T4-optimized model choice: 7B at 4-bit fits perfectly
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=2048,    # T4 limit for 7B
    load_in_4bit=True,
    dtype=torch.float16,   # T4 doesn't have BF16 hardware support
)

model = FastLanguageModel.get_peft_model(
    model,
    r=32,                  # Lower rank to save memory on T4
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # Fewer modules
    lora_alpha=64,
    use_gradient_checkpointing="unsloth",  # Essential on T4
    random_state=42,
)

# T4-friendly training args
training_args = TrainingArguments(
    output_dir="./t4-output",
    num_train_epochs=1,
    per_device_train_batch_size=1,  # Small batch for T4
    gradient_accumulation_steps=16, # Simulate larger batch
    learning_rate=2e-4,
    fp16=True,                      # T4 uses FP16 (not BF16)
    optim="adamw_8bit",
    logging_steps=5,
    save_steps=50,
    warmup_ratio=0.1,
    max_grad_norm=1.0,
    dataloader_num_workers=2,
)

# Colab tip: enable high RAM runtime
# Runtime -> Change runtime type -> High-RAM
# This gives 25 GB system RAM for data loading

# Anti-crash tip for Colab: save frequently
# Colab disconnects after ~90 min of inactivity
# Use save_steps=50 and resume_from_checkpoint=True
"""

KAGGLE_CONFIG = """
# ============================================================
# Optimized for Kaggle (P100/T4, 30 hrs/week FREE)
# ============================================================
# Kaggle advantages vs Colab:
# - More stable (no disconnects)
# - GPU: 30 hours per week total
# - Internet access enabled in settings
# - Persistent storage (up to 20 GB)

# Kaggle P100 (16GB, older but still useful)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
    dtype=torch.float16,  # P100 doesn't support BF16
)

# Install Unsloth on Kaggle:
# !pip install unsloth -q
# Enable internet in Kaggle notebook settings first

# Save to Kaggle output directory (persists across sessions)
model.save_pretrained("/kaggle/working/my-adapter")
"""

print("Colab T4 Configuration:")
print(COLAB_T4_CONFIG)
print("\nKaggle Configuration:")
print(KAGGLE_CONFIG)

12. Full End-to-End Example: Custom Chatbot in One NotebookΒΆ

Putting it all together: fine-tune a custom chatbot from scratch to deployment.

# Complete pipeline from data to deployable model

COMPLETE_PIPELINE = """
# ============================================================
# Complete Unsloth Fine-Tuning Pipeline: Data -> Deployment
# ============================================================
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset, Dataset
import torch, json

# ---- STEP 1: Prepare your data ----
# Format: list of conversation dicts
training_data = [
    {
        "messages": [
            {"role": "system",    "content": "You are an expert Python tutor."},
            {"role": "user",      "content": "What is a list comprehension?"},
            {"role": "assistant", "content": "A list comprehension is a concise way..."}
        ]
    },
    # ... more examples
]

# Or load from HuggingFace
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:5000]")

def format_messages(example):
    text = ""
    for msg in example["messages"]:
        text += f"<|im_start|>{msg['role']}\\n{msg['content']}<|im_end|>\\n"
    return {"text": text}

dataset = dataset.map(format_messages, remove_columns=dataset.column_names)

# ---- STEP 2: Load model ----
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# ---- STEP 3: Configure LoRA ----
model = FastLanguageModel.get_peft_model(
    model, r=64, lora_alpha=128,
    target_modules="all-linear",
    use_rslora=True, use_gradient_checkpointing="unsloth",
)

# ---- STEP 4: Train ----
trainer = SFTTrainer(
    model=model, tokenizer=tokenizer,
    train_dataset=dataset, dataset_text_field="text",
    max_seq_length=2048, packing=True,
    args=TrainingArguments(
        output_dir="./output", num_train_epochs=3,
        per_device_train_batch_size=4, gradient_accumulation_steps=4,
        learning_rate=2e-4, bf16=True, optim="adamw_8bit",
        logging_steps=10, save_steps=100, warmup_ratio=0.1,
    )
)
trainer.train()

# ---- STEP 5: Test the model ----
FastLanguageModel.for_inference(model)  # Switch to inference mode

inputs = tokenizer([
    "<|im_start|>user\\nExplain recursion in Python<|im_end|>\\n<|im_start|>assistant\\n"
], return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=256, use_cache=True)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

# ---- STEP 6: Save for deployment ----
# Option A: LoRA adapter (for HuggingFace inference)
model.save_pretrained("./my-python-tutor")
tokenizer.save_pretrained("./my-python-tutor")

# Option B: GGUF for Ollama/LM Studio
model.save_pretrained_gguf("./my-python-tutor-gguf", tokenizer, "q4_k_m")

# Option C: Push to HuggingFace Hub
model.push_to_hub_gguf("username/my-python-tutor", tokenizer, "q4_k_m", token="hf_...")

# ---- DEPLOY WITH OLLAMA ----
# After saving GGUF:
# 1. Create Modelfile:
#    FROM ./my-python-tutor-gguf/model-unsloth.Q4_K_M.gguf
#    SYSTEM You are an expert Python tutor.
# 2. ollama create my-python-tutor -f Modelfile
# 3. ollama run my-python-tutor
"""

print(COMPLETE_PIPELINE)
# Inference with Unsloth-trained model
# FastLanguageModel.for_inference() enables 2x faster inference

INFERENCE_CODE = """
# ============================================================
# Fast Inference with Unsloth
# ============================================================
from unsloth import FastLanguageModel
import torch

# Load the fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./my-python-tutor",  # Your saved adapter
    max_seq_length=2048,
    load_in_4bit=True,
)

# CRITICAL: Switch to inference mode (disables LoRA training overhead)
FastLanguageModel.for_inference(model)

def chat(user_message: str, system_prompt: str = None, max_tokens: int = 512) -> str:
    """Simple chat function using the fine-tuned model."""
    sys = system_prompt or "You are a helpful AI assistant."
    prompt = (
        f"<|im_start|>system\\n{sys}<|im_end|>\\n"
        f"<|im_start|>user\\n{user_message}<|im_end|>\\n"
        "<|im_start|>assistant\\n"
    )
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
    with torch.inference_mode():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.7,
            do_sample=True,
            use_cache=True,           # Unsloth: always use KV cache
            pad_token_id=tokenizer.eos_token_id,
        )
    response = tokenizer.decode(
        outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
    )
    return response.strip()

# Test
response = chat("Explain list comprehensions in Python with examples.")
print(response)

# Streaming with Unsloth
from transformers import TextStreamer

streamer = TextStreamer(tokenizer, skip_prompt=True)
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
_ = model.generate(**inputs, max_new_tokens=256, streamer=streamer, use_cache=True)
"""

print(INFERENCE_CODE)

13. Performance Comparison SummaryΒΆ

Comprehensive benchmark: Unsloth vs Axolotl vs standard TRL across different tasks.

import matplotlib.pyplot as plt
import numpy as np

# Benchmark data (7B model, A100 80GB)
tasks = ["SFT 2K ctx", "SFT 8K ctx", "DPO", "GRPO (G=6)", "Full FT"]

# Tokens per second
standard_trl_tps = [1850, 920, 1200, 310, 890]
axolotl_tps = [2100, 1100, 1350, None, 980]
unsloth_tps = [5200, 3800, 3900, 1250, 4100]

# Memory (GB)
standard_trl_mem = [38.2, 62.1, 44.5, 71.3, 73.8]
axolotl_mem = [32.5, 54.0, 38.2, None, 68.5]
unsloth_mem = [7.8, 14.2, 11.4, 22.1, 42.3]

x = np.arange(len(tasks))
width = 0.28

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
fig.suptitle("Performance Comparison: Unsloth vs Axolotl vs Standard TRL (7B, A100 80GB)", fontsize=12)

# Speed
bars_std = ax1.bar(x - width, standard_trl_tps, width, label="Standard TRL", color="#e74c3c", alpha=0.85)
ax1_axo = [v if v else 0 for v in axolotl_tps]
bars_axo = ax1.bar(x, ax1_axo, width, label="Axolotl", color="#3498db", alpha=0.85)
bars_unsloth = ax1.bar(x + width, unsloth_tps, width, label="Unsloth", color="#2ecc71", alpha=0.85)

ax1.set_title("Training Speed (Tokens/sec, higher = better)")
ax1.set_xticks(x)
ax1.set_xticklabels(tasks, rotation=15, ha="right")
ax1.set_ylabel("Tokens per Second")
ax1.legend()
ax1.grid(axis="y", alpha=0.3)

# Add speedup annotations on Unsloth bars
for i, (bar, std_val) in enumerate(zip(bars_unsloth, standard_trl_tps)):
    if bar.get_height() > 0 and std_val > 0:
        speedup = bar.get_height() / std_val
        ax1.text(
            bar.get_x() + bar.get_width() / 2,
            bar.get_height() + 50,
            f"{speedup:.1f}x",
            ha="center", va="bottom", fontsize=8, fontweight="bold", color="#27ae60"
        )

# Memory
mem_axo = [v if v else 0 for v in axolotl_mem]
ax2.bar(x - width, standard_trl_mem, width, label="Standard TRL", color="#e74c3c", alpha=0.85)
ax2.bar(x, mem_axo, width, label="Axolotl", color="#3498db", alpha=0.85)
bars_mem = ax2.bar(x + width, unsloth_mem, width, label="Unsloth", color="#2ecc71", alpha=0.85)

ax2.axhline(y=80, color="gray", linestyle=":", alpha=0.6, label="A100 80GB")
ax2.axhline(y=40, color="orange", linestyle=":", alpha=0.6, label="A100 40GB")
ax2.axhline(y=16, color="red", linestyle=":", alpha=0.6, label="T4 free tier")

ax2.set_title("Peak GPU Memory (GB, lower = better)")
ax2.set_xticks(x)
ax2.set_xticklabels(tasks, rotation=15, ha="right")
ax2.set_ylabel("GPU Memory (GB)")
ax2.legend(fontsize=8)
ax2.grid(axis="y", alpha=0.3)

plt.tight_layout()
plt.savefig("unsloth_full_comparison.png", dpi=120, bbox_inches="tight")
plt.show()
print("Full comparison chart saved.")