AI Model Landscape: March 6, 2026ΒΆ
A comprehensive reference for learners navigating the rapidly evolving AI ecosystem. Use this guide to understand which models, tools, and techniques are worth your time right now.
Table of ContentsΒΆ
1. Frontier Closed Models (March 2026)ΒΆ
These are the state-of-the-art proprietary models available via API. You cannot download or fine-tune them directly, but they set the performance ceiling that open-weight models are converging toward.
GPT-5.4 (OpenAI)ΒΆ
Released: March 5, 2026
Context window: 1,100,000 tokens (1.1M)
Benchmark: 83% match or beat industry professionals on GDPval real-world tasks; 33% fewer errors than GPT-5.2
Multimodal: text, images, code, structured data; native computer use
API:
gpt-5.4(standard),gpt-5.4-pro(max performance, Responses API only)Pricing: \(2.50/1M input tokens, \)15.00/1M output tokens; batch/flex at 50% off
Key new feature: tool search β model dynamically looks up tool definitions at inference time, reducing token use for tool-heavy workflows
Variants: GPT-5.4 (standard), GPT-5.4 Thinking (reasoning mode), GPT-5.4 Pro (max effort)
Best for: complex reasoning, agentic workflows, long-document analysis, highest-stakes tasks
Intelligence Index rank: #2 tied (score 57/57 with Gemini 3.1 Pro Preview) β artificialanalysis.ai
o3 / o4-mini (OpenAI β Reasoning Models)ΒΆ
Architecture: inference-time compute scaling (βthinkingβ before answering)
o3: flagship reasoning model; best at math, science, coding competitions
o4-mini: much cheaper than o3, retains most reasoning capability
Context: 200K tokens (o3), 128K tokens (o4-mini)
Pricing: o3 at \(2.00/1M input, \)8.00/1M output
Use case: problems that benefit from step-by-step chain-of-thought reasoning
Limitation: slower than standard models (generates reasoning traces internally)
Best for: theorem proving, complex code debugging, multi-step planning
Note: GPT-5.4 Thinking now offers reasoning mode within the 5.4 family; o3 remains the dedicated reasoning specialist
Claude 4.6 Family (Anthropic)ΒΆ
Models: Opus 4.6 (most capable), Sonnet 4.6 (balanced), Haiku 4.5 (fastest)
Key update: Sonnet 4.6 launched in Feb 2026; Claude lineup is now centered on the 4.x family
Strengths: coding, agent workflows, long-form analysis, enterprise safety controls
Best for: teams needing strong instruction-following and reliable coding/analysis across workloads
API: see current model IDs in Anthropic platform docs; names evolve quickly across preview/stable variants
Intelligence Index rank: Opus 4.6 (Adaptive Reasoning, Max Effort) #4 (score 53); Sonnet 4.6 #5 (score 52) β artificialanalysis.ai
Gemini 3.1 Pro / Gemini 2.5 Family (Google)ΒΆ
Context window: up to 1,000,000 tokens depending on model variant
Core models: Gemini 3.1 Pro Preview, Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 2.5 Flash-Lite
Important deprecation note: Gemini 3 Pro Preview is deprecated; migrate to Gemini 3.1 Pro Preview
Pricing: Gemini 3.1 Pro Preview at \(2.00/1M input, \)12.00/1M output
Strengths: massive context, strong multimodal and agentic tool use, broad media generation stack
API: model strings vary by stable/preview/latest aliases in Google AI Studio / Vertex AI
Best for: large-context analysis, multimodal applications, real-time and voice-enabled agents
Intelligence Index rank: #1 (score 57) β artificialanalysis.ai
Mistral Large 3 (Mistral AI)ΒΆ
Architecture: 675B parameter Mixture-of-Experts (MoE)
Active parameters: ~45B activated per forward pass (sparse MoE)
Cost: approximately 15% the cost of GPT-5.4 at equivalent quality for most tasks
Context: 128K tokens
Strengths: European GDPR compliance, strong multilingual, cost-efficient
API:
mistral-large-latestvia Mistral AI platform (la Plateforme)Best for: cost-sensitive production workloads where GPT-5.4 quality is not required
Quick ComparisonΒΆ
Model |
Intelligence Index |
Context |
Pricing (input/output per 1M) |
Strengths |
|---|---|---|---|---|
Gemini 3.1 Pro Preview (Google) |
57 (#1) |
1M |
\(2.00 / \)12.00 |
Massive context, multimodal, agentic |
GPT-5.4 (OpenAI) |
57 (#2) |
1.1M |
\(2.50 / \)15.00 |
Computer use, tool search, all-around |
GPT-5.3 Codex (OpenAI) |
54 (#3) |
β |
β |
Coding, agentic software tasks |
Claude Opus 4.6 (Anthropic) |
53 (#4) |
200K+ |
β |
Coding, agents, instruction following |
Claude Sonnet 4.6 (Anthropic) |
52 (#5) |
200K+ |
β |
Balanced quality & speed |
o3 (OpenAI) |
β |
200K |
\(2.00 / \)8.00 |
Math, science, reasoning |
o4-mini (OpenAI) |
β |
128K |
Medium |
Reasoning, cost-efficient |
Mistral Large 3 (Mistral) |
β |
128K |
Low |
Cost-efficient, multilingual, GDPR |
Intelligence Index scores from artificialanalysis.ai β composite of 10 independent benchmarks.
2. Best Open-Weight Models (March 2026)ΒΆ
Open-weight models can be downloaded, self-hosted, fine-tuned, and run privately. The gap with closed models has narrowed dramatically.
GLM-5 (Zhipu AI) β Top Open-Weight ModelΒΆ
Intelligence Index rank: #1 open-weight (score 50) β artificialanalysis.ai
Key feature: reasoning mode; strong performance closing gap with frontier closed models
Best for: production deployments where top open-weight quality is needed
Kimi K2.5 (Moonshot AI)ΒΆ
Intelligence Index rank: #2 open-weight (score 47)
Key feature: reasoning mode; strong multilingual and long-context capabilities
Best for: reasoning-heavy tasks as an open-weight alternative to o3-class models
Llama 4 Maverick / Scout (Meta)ΒΆ
Maverick: 400B parameter MoE; flagship, competitive with GPT-5.4 class on many benchmarks
Scout: 17B active / 109B total MoE; fastest, designed for edge deployment
Context: up to 10,000,000 tokens (10M β industry record for open-weight)
Multimodal: native image and text understanding
License: Llama 4 Community License (commercial use permitted with restrictions)
Download:
meta-llama/Llama-4-Maverick,meta-llama/Llama-4-ScoutBest for: the best open-weight option when you need top-tier quality; RAG with very long contexts
Qwen 3 235B-A22B (Alibaba)ΒΆ
Architecture: 235B total / 22B active parameters (MoE)
Key feature: hybrid thinking/non-thinking mode β toggle chain-of-thought per request
Benchmark: strong open-source contender; Qwen 3.5 397B A17B now ranks #3 open-weight (Intelligence Index: 45)
Context: 128K tokens
License: Apache 2.0 (fully permissive commercial use)
Download:
Qwen/Qwen3-235B-A22BBest for: production OSS deployments where quality is top priority; reasoning tasks
Note: the 22B active parameter count means inference cost is much lower than the 235B total suggests
Qwen 3.5 Family (Alibaba)ΒΆ
Recent release: major refresh in early March 2026 with larger multimodal MoE variants
Notable checkpoints: Qwen3.5-397B-A17B, Qwen3.5-122B-A10B, Qwen3.5-35B-A3B and smaller tiers
Strengths: strong multimodal quality, broad size ladder for production optimization, active quantized ecosystem
License: Apache 2.0 lineage for core Qwen open checkpoints
Best for: teams that want one family spanning edge to high-end cluster deployments
Qwen3-Coder-Next (Alibaba)ΒΆ
Focus: code generation and software engineering agents
Notable checkpoint: 80B class model with actively maintained quantized variants
Best for: code assistants, repo-scale refactors, and agentic software engineering workflows
DeepSeek R1 (DeepSeek)ΒΆ
Architecture: 671B MoE with 37B active parameters; trained end-to-end with GRPO reinforcement learning
Training cost: approximately $6 million β a landmark efficiency achievement
Specialization: reasoning, mathematics, coding, scientific problems
Context: 128K tokens
License: MIT (fully permissive, including fine-tuning and commercial use)
Download:
deepseek-ai/DeepSeek-R1Also available: DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Llama-8B (distilled smaller versions)
Best for: reasoning-heavy tasks; studying how to train reasoning models (RL with GRPO)
DeepSeek V3.2 (DeepSeek)ΒΆ
Architecture: 685B MoE (37B active); successor to DeepSeek V3
Strengths: strongest base model for general-purpose tasks from DeepSeek
License: MIT
Download:
deepseek-ai/DeepSeek-V3Best for: base model fine-tuning for general applications when maximum capability is needed
Gemma 3 27B (Google)ΒΆ
Context: 128K tokens
Languages: 140+ languages natively supported
Hardware: fits on a single consumer GPU (RTX 4090 24GB with quantization)
License: Gemma Terms of Use (permissive commercial use)
Download:
google/gemma-3-27b-itKey feature: among the most capable models that fit on a single GPU
Best for: multilingual applications; single-GPU production deployment; fine-tuning on consumer hardware
Phi-4 (Microsoft)ΒΆ
Size: 14B parameters
Specialization: STEM β math, science, coding; trained on high-quality synthetic data
License: MIT (fully permissive)
Download:
microsoft/phi-4Strengths: punches far above its weight on reasoning and math; excellent for structured outputs
Best for: math tutoring, code generation, scientific Q&A; fine-tuning when training compute is limited
Phi-4-Reasoning-Vision 15B (Microsoft)ΒΆ
Size: 15B multimodal reasoning model
Focus: image-text reasoning plus strong general reasoning behavior
Best for: multimodal assistants that need better reasoning than small vision-language baselines
Phi-4-mini (Microsoft)ΒΆ
Size: 3.8B parameters
License: MIT
Download:
microsoft/phi-4-miniBest for: on-device inference, mobile, or when you need a capable small model
Quick ComparisonΒΆ
Model |
Intelligence Index |
Params (active) |
Context |
License |
Best For |
|---|---|---|---|---|---|
GLM-5 (Zhipu AI) |
50 (#1 OW) |
β |
β |
β |
Top open-weight quality |
Kimi K2.5 (Moonshot) |
47 (#2 OW) |
β |
β |
β |
Reasoning, multilingual |
Qwen 3.5 397B-A17B |
45 (#3 OW) |
17B active |
128K+ |
Apache 2.0 |
Latest Qwen reasoning |
Llama 4 Maverick |
β |
400B MoE |
10M |
Llama 4 |
Ultra-long context |
Llama 4 Scout |
β |
17B active |
10M |
Llama 4 |
Fast, long-context, edge |
Qwen 3 235B-A22B |
β |
22B active |
128K |
Apache 2.0 |
Production OSS deployments |
Qwen3-Coder-Next |
β |
80B class |
β |
Apache 2.0 |
Code and SWE agents |
DeepSeek R1 |
β |
37B active |
128K |
MIT |
Reasoning, math, coding |
DeepSeek V3.2 |
β |
37B active |
128K |
MIT |
General base model |
Gemma 3 27B |
β |
27B |
128K |
Gemma ToU |
Single GPU, multilingual |
Phi-4 |
β |
14B |
16K |
MIT |
STEM, math, coding |
Phi-4-mini |
β |
3.8B |
16K |
MIT |
On-device, mobile |
Intelligence Index (OW = open-weight rank) from artificialanalysis.ai. GLM-5/Kimi K2.5 license/param details subject to provider docs.
3. Best Models for Fine-tuning (March 2026)ΒΆ
Not all open-weight models are equally good starting points for fine-tuning. These are the recommended choices grouped by compute budget.
Small Models (less than 8B parameters)ΒΆ
Best when you have limited GPU memory (less than 24GB) or need fast inference.
Model |
Params |
License |
Why Fine-tune It |
|---|---|---|---|
Qwen2.5-7B-Instruct |
7B |
Apache 2.0 |
Best quality in class, excellent instruction following, tokenizer supports 100+ languages |
Llama 3.2 3B |
3B |
Llama 3.2 |
Metaβs smallest capable model; fast inference; widely supported |
Phi-4-mini |
3.8B |
MIT |
Strong reasoning for size; MIT license; good for STEM tasks |
Gemma 3 4B |
4B |
Gemma ToU |
Google quality at 4B; 128K context; good multilingual |
Recommended starter: Qwen/Qwen2.5-7B-Instruct β best quality, Apache 2.0 license, great tokenizer.
Medium Models (8B to 14B parameters)ΒΆ
Good balance of quality and fine-tuning cost. Fits in 24GB GPU with QLoRA.
Model |
Params |
License |
Why Fine-tune It |
|---|---|---|---|
Phi-4 |
14B |
MIT |
Best STEM/reasoning quality at 14B; MIT license |
Qwen2.5-14B-Instruct |
14B |
Apache 2.0 |
Strong across all domains; excellent for structured output fine-tuning |
Gemma 3 12B |
12B |
Gemma ToU |
Google quality, 128K context; solid multilingual support |
Recommended starter: microsoft/phi-4 for STEM/coding tasks; Qwen/Qwen2.5-14B-Instruct for general tasks.
Large Models (32B and above)ΒΆ
Requires multi-GPU or A100/H100 for fine-tuning. Produces the highest-quality specialized models.
Model |
Params |
License |
Why Fine-tune It |
|---|---|---|---|
Qwen2.5-32B-Instruct |
32B |
Apache 2.0 |
Best quality at 32B; fits on 2x A100 with QLoRA |
Llama 3.3 70B |
70B |
Llama 3.3 |
Metaβs most capable dense model; widely supported by fine-tuning tooling |
DeepSeek R1 70B Distill |
70B |
MIT |
Distilled from DeepSeek R1; strong reasoning; MIT license |
Recommended starter: Qwen/Qwen2.5-32B-Instruct if you have 2x A100; meta-llama/Llama-3.3-70B-Instruct for the widest tooling support.
Fine-tuning Model Selection Decision TreeΒΆ
Do you have > 2x A100/H100?
Yes -> Qwen2.5-32B or Llama 3.3 70B
No -> Single A100/H100 (80GB)?
Yes -> Qwen2.5-14B or Phi-4 (with QLoRA)
No -> Single 24GB GPU (e.g. RTX 4090)?
Yes -> Qwen2.5-7B or Phi-4-mini (QLoRA)
No -> Phi-4-mini or Llama 3.2 3B (4-bit QLoRA)
Is your task math/coding/STEM?
Yes -> Prefer Phi-4 or DeepSeek R1 distilled variants
Do you need Apache 2.0 / MIT license?
Yes -> Qwen2.5 (Apache 2.0) or Phi-4 / DeepSeek R1 (MIT)
-> Avoid Llama 4 / Gemma (more restrictive licenses)
4. Key Training Techniques (2025-2026)ΒΆ
GRPO β Group Relative Policy OptimizationΒΆ
What it is: A reinforcement learning algorithm for training reasoning models without a critic network.
Why it matters: DeepSeek R1 was trained with GRPO, achieving o1-level reasoning at a fraction of the cost. GRPO samples multiple responses per prompt and uses group-relative rewards instead of a value function baseline.
Advantage over PPO: no separate critic model needed (halves training memory); more stable for LLM RLHF.
Implementation: available in TRL (
trl.GRPOTrainer) and Unsloth.Use when: training reasoning/math models from scratch or fine-tuning with RL-based feedback.
LoRA / QLoRA / DoRA / RSLoRAΒΆ
Parameter-efficient fine-tuning (PEFT) methods that update only a small fraction of model parameters.
Method |
Description |
When to Use |
|---|---|---|
LoRA |
Low-Rank Adapters: insert small A*B matrices alongside frozen weights |
Standard fine-tuning; 2-5% of params |
QLoRA |
LoRA on top of 4-bit quantized base model |
Limited GPU memory (fits 70B in 48GB) |
DoRA |
Decomposed LoRA: separates magnitude and direction updates |
When LoRA underfits; slightly better quality |
RSLoRA |
Rank-Stabilized LoRA: scales learning rates by sqrt(rank) |
High-rank LoRA (r>=64) for better stability |
Current best practice: QLoRA with rsLoRA scaling and rank=16-64 for most fine-tuning tasks.
DPO / SimPO / KTO β Alignment Without PPOΒΆ
Methods for aligning model outputs to human preferences without the complexity of PPO.
Method |
Description |
When to Use |
|---|---|---|
DPO (Direct Preference Optimization) |
Fine-tune on (chosen, rejected) pairs directly |
Standard alignment; simplest setup |
SimPO (Simple Preference Optimization) |
DPO variant with length normalization and margin reward |
Better than DPO for instruction following |
KTO (Kahneman-Tversky Optimization) |
Works with binary feedback (good/bad) rather than pairs |
When you have binary labels, not ranked pairs |
Current best practice: SimPO slightly outperforms DPO in most benchmarks; KTO if you only have thumbs-up/thumbs-down data.
Unsloth β 2-5x Faster Fine-tuningΒΆ
What it is: A library that rewrites CUDA kernels for transformer operations to be more memory-efficient and faster.
Speedup: 2-5x faster training vs standard Hugging Face + PEFT; 50-70% less GPU memory.
Integration: drop-in replacement for Hugging Face
AutoModelForCausalLM.from_pretrained.Supports: LoRA, QLoRA, DoRA, GRPO, DPO, SFT; works with Qwen, Llama, Mistral, Phi, Gemma.
Install:
pip install unsloth
# Unsloth replaces the standard HF model loading
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Qwen/Qwen2.5-7B-Instruct",
max_seq_length=8192,
load_in_4bit=True, # QLoRA
)
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
use_rslora=True, # RSLoRA scaling
)
5. Key Infrastructure (2025-2026)ΒΆ
Inference Engines (ranked by throughput)ΒΆ
SGLang > vLLM > TGI > Ollama / llama.cpp
Tool |
Throughput |
Best For |
Install |
|---|---|---|---|
SGLang |
Highest (~16,200 tok/s H100) |
Production API, RAG, agents β RadixAttention caches shared prefixes |
|
vLLM |
Very High (~12,500 tok/s H100) |
Production API, multi-LoRA serving, AMD support |
|
TGI |
High (~9,800 tok/s H100) |
HuggingFace ecosystem, Inference Endpoints |
Docker |
Ollama |
Medium |
Local dev, Apple Silicon, edge, no GPU |
Single binary |
llama.cpp |
Low-Medium |
Embedded, CPU-only, maximum portability |
Compile from source |
Key 2025-2026 development: SGLangβs RadixAttention provides automatic KV cache reuse for shared prefixes (system prompts, RAG context, few-shot examples). For agent and RAG workloads, this means 5-10x faster time-to-first-token after the first request.
Fine-tuning Frameworks (ranked by efficiency)ΒΆ
Unsloth > Axolotl > standard TRL
Tool |
Speedup |
Best For |
|---|---|---|
Unsloth |
2-5x faster, 50-70% less memory |
Most fine-tuning tasks; LoRA, QLoRA, GRPO, DPO |
Axolotl |
1.5-2x (config-driven) |
Teams that prefer YAML config over code; multi-GPU |
TRL (HuggingFace) |
Baseline |
Reference implementation; maximum compatibility |
Agent Frameworks (2025-2026)ΒΆ
The agent ecosystem has consolidated around a few dominant tools:
Tool |
Description |
Best For |
|---|---|---|
MCP (Model Context Protocol) |
Anthropicβs open standard for connecting LLMs to tools/data sources |
Universal tool/plugin standard; works across Claude, GPT, open models |
OpenAI Agents SDK |
Official SDK for building multi-agent systems; includes tracing and handoffs |
Production agents with OpenAI models; clean abstraction |
LangGraph 1.0 |
Graph-based agent orchestration with persistent state |
Complex multi-step agents, branching workflows, stateful agents |
LlamaIndex Workflows |
Event-driven agent workflows |
RAG-heavy agents, document pipelines |
Key insight: MCP has become the standard βUSB for AI toolsβ β build MCP servers once, connect to any MCP-compatible agent framework. In 2026, most enterprise tools (databases, APIs, file systems) have MCP servers available.
Vector Databases (2025-2026)ΒΆ
Database |
Description |
Best For |
|---|---|---|
Pinecone |
Managed cloud vector DB; serverless tier available |
Production SaaS; zero ops overhead |
Qdrant |
Rust-based, high performance; excellent filtering |
Self-hosted production; best performance/cost ratio |
Chroma |
Rewritten in Rust (2025); much faster than original Python |
Local dev and small-medium production; easiest setup |
pgvector |
Postgres extension; vectors + SQL in one database |
Existing Postgres users; no separate infrastructure |
Weaviate |
Feature-rich; built-in hybrid search (BM25 + vector) |
Hybrid search requirements; GraphQL API |
Current recommendation: Chroma for development (easiest), Qdrant for self-hosted production (best performance), pgvector if you already run Postgres.
6. What Changed Since February 2026ΒΆ
This section captures practical updates since the February snapshot so learners can prioritize what changed.
Model Selection Updates (March 2026)ΒΆ
Reasoning-first routing is now standard: use reasoning models (
o3,o4-mini, DeepSeek R1 family) for math/coding/planning tasks; use standard chat models for latency-sensitive UX.Very-long-context workflows matured: 1M+ token workflows are increasingly practical for repository analysis, long legal docs, and multimodal audit pipelines.
Open-weight quality ceiling improved: Qwen3, Llama 4, and DeepSeek families are now viable for many production tasks that previously required frontier APIs.
License-aware model choice is now mandatory: teams increasingly split by policy: Apache/MIT-first stacks for commercial redistribution, and restricted-license stacks for internal-only deployments.
This-month additions: GPT-5.4 (March 5, 2026) with 1.1M context, native computer use, and tool search; GLM-5 and Kimi K2.5 emerged as top open-weight leaders on artificialanalysis.ai; Claude 4.6 family, Qwen 3.5 / Qwen3-Coder-Next, and Phi-4-Reasoning-Vision also on shortlists.
Intelligence leaderboard snapshot (artificialanalysis.ai, 282 models ranked): Gemini 3.1 Pro Preview and GPT-5.4 tied at #1 (score 57); top open-weight is GLM-5 (score 50).
Training and Alignment Updates (March 2026)ΒΆ
SimPO and DPO remain default alignment baselines for most teams; PPO-style stacks are mostly reserved for specialized research workflows.
GRPO adoption increased for reasoning-tuned models and synthetic curriculum training.
Unsloth + TRL became a common default for small/medium fine-tuning projects due to speed and memory efficiency.
Inference and Agent Stack Updates (March 2026)ΒΆ
SGLang and vLLM remain the top two production inference choices; SGLang keeps an edge in high-throughput agent/RAG workloads with shared-prefix caching.
MCP solidified as the cross-vendor tool protocol in agent ecosystems.
LangGraph + OpenAI Agents SDK + MCP is a common production architecture for stateful, tool-using systems.
March 2026 Practical DefaultsΒΆ
Layer |
March 2026 Default Recommendation |
|---|---|
Fast API assistant |
|
Hard reasoning |
|
Best frontier API |
|
Open-weight #1 (quality) |
GLM-5 (Reasoning) |
Open-weight general |
Qwen3 235B-A22B or Qwen3.5 397B-A17B |
Open-weight long context |
Llama 4 Maverick/Scout |
Small local model |
Phi-4-mini / Gemma 3 4B |
Fine-tuning starter |
Qwen2.5-7B + QLoRA + RSLoRA |
Inference server |
SGLang (prod) / Ollama (local dev) |
Agent runtime |
LangGraph + MCP + OpenAI Agents SDK |
7. What to Learn in What OrderΒΆ
A structured learning path for March 2026. Follow this order to build solid foundations before tackling advanced topics.
Phase 1: Foundations (Weeks 1-4)ΒΆ
Goal: Understand what LLMs are and how to use them via API.
Python for AI β numpy, pandas, matplotlib basics; Jupyter notebooks
Prompt engineering β zero-shot, few-shot, chain-of-thought; system prompts
OpenAI / Anthropic API β calling GPT-5.4 and Claude Sonnet 4.6; streaming; function calling; computer use
Tokenization β how text becomes tokens; tiktoken; why context window size matters
RAG basics β chunking documents, embedding models, Chroma, similarity search
Milestone: Build a document Q&A chatbot using RAG with a frontier API model.
Phase 2: Open-Weight Models (Weeks 5-8)ΒΆ
Goal: Run and understand open-weight models locally and in the cloud.
Hugging Face Transformers β
pipeline,AutoModelForCausalLM,AutoTokenizerRunning models locally β Ollama (easiest), then llama.cpp
Quantization β understand BF16 vs GPTQ vs AWQ vs GGUF; trade-offs
Chat templates β ChatML, Llama 3 template, Qwen template; why they matter
vLLM / SGLang β run a production-grade API server; benchmark throughput
Model selection β when to use Qwen2.5-7B vs Llama 4 vs Gemma 3; the decision tree above
Milestone: Self-host a Qwen2.5-7B model via SGLang and serve it behind a FastAPI endpoint.
Phase 3: Fine-tuning (Weeks 9-14)ΒΆ
Goal: Adapt pre-trained models to specific tasks.
Dataset preparation β instruction format, chat format; quality over quantity
LoRA fundamentals β what low-rank adapters do mathematically; rank, alpha, target modules
QLoRA with Unsloth β fine-tune a 7B model in under 4GB VRAM
SFT (Supervised Fine-Tuning) β
SFTTrainerfrom TRL; data formatting; evaluation during trainingDPO / SimPO alignment β preference datasets;
DPOTrainer; when to align vs just SFTGRPO for reasoning β training a model to think step-by-step with RL
Evaluation β ROUGE, BERTScore, LLM-as-judge; avoiding eval data contamination
Milestone: Fine-tune Qwen2.5-7B on a domain-specific dataset, align with DPO, and evaluate with an LLM judge.
Phase 4: Agents and Advanced RAG (Weeks 15-20)ΒΆ
Goal: Build production-grade agentic systems.
Function calling / tool use β structured outputs, JSON mode; tool definitions
ReAct agents β Reason + Act loop; building with OpenAI Agents SDK
MCP (Model Context Protocol) β write an MCP server; connect to Claude Desktop / your agent
Advanced RAG β hybrid search, reranking (ColBERT, cross-encoders), HyDE, RAPTOR
LangGraph 1.0 β stateful agents, branching workflows, multi-agent handoffs
Agentic evaluation β trajectory evaluation; tool use accuracy; multi-turn benchmarks
Milestone: Build a multi-step research agent using LangGraph + MCP that can search the web, read documents, and write structured reports.
Phase 5: Production and MLOps (Weeks 21-24)ΒΆ
Goal: Deploy and maintain AI systems reliably.
Docker + GPU deployment β containerizing vLLM/SGLang; docker-compose with GPU support
Monitoring β structured logging; latency p95/p99; token usage; refusal rate
Cost optimization β quantization trade-offs; batching strategies; spot vs on-demand pricing
Continuous evaluation β A/B testing model updates; production feedback loops
Security β prompt injection defense; PII detection; rate limiting; API key management
MLflow / Weights & Biases β experiment tracking; model registry; deployment tracking
Milestone: Deploy your fine-tuned model to a cloud GPU with monitoring, a Docker container, cost tracking, and an A/B testing setup for comparing model versions.
Summary: The 2026 AI Engineer StackΒΆ
Layer Tools (March 2026 Best Choices)
--------- ------------------------------
Frontier APIs GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro Preview
Open Models Qwen3 235B-A22B, Llama 4 Maverick, DeepSeek R1
Fine-tuning Base Qwen2.5-7B (small), Phi-4 (medium), Qwen2.5-32B (large)
Fine-tuning Tools Unsloth + TRL (SFT/DPO/GRPO)
Inference SGLang (production), Ollama (local/dev)
Agents MCP + OpenAI Agents SDK + LangGraph 1.0
Vector DB Qdrant (production), Chroma (dev), pgvector (Postgres users)
Experiment Tracking Weights & Biases or MLflow
Observability Langfuse (open source) or LangSmith
The pace of change in this field is fast, but the underlying principles β retrieval, fine-tuning, alignment, evaluation, deployment β remain stable. Master the fundamentals and you will adapt quickly as the specific tools evolve.