AI/ML Decision Matrices & Comparison GuidesΒΆ

Quick reference for choosing the right tools, models, and approaches for your AI/ML projects (April 2026)

Table of ContentsΒΆ

LLM Model SelectionΒΆ

When should you use which LLM? (April 2026)

Model

Best For

Pros

Cons

Cost (1M tokens)

When to Use

GPT-5.4

Production, multimodal

Latest OpenAI flagship, vision + audio

Expensive

\(2.50 in / \)10 out

Complex tasks, multimodal, production apps

GPT-4.1

Development, balanced

Great quality, 1M context window

Higher cost than mini tiers

\(2 in / \)8 out

General production, long-context tasks

GPT-4.1-mini

Development, testing

Fast, cheap, good quality

Not as capable

\(0.40 in / \)1.60 out

Development, high-volume tasks, chatbots

o3

Research, reasoning

Top reasoning, complex problems

Slow, expensive

\(10 in / \)40 out

Math, coding, research, complex analysis

o4-mini

Coding, math

Fast reasoning, cost-effective

Limited general knowledge

\(1.10 in / \)4.40 out

Code generation, STEM problems

Claude Opus 4.6

Complex coding, agents

Best at code, 200k context, tool use

Expensive

\(15 in / \)75 out

Agentic coding, deep analysis, complex tasks

Claude Sonnet 4.6

Coding, analysis

Excellent at code, 200k context

API only

\(3 in / \)15 out

Code reviews, daily coding, balanced cost

Claude Haiku 4.5

Speed, high volume

Fastest Claude, very cheap

Less capable

\(0.80 in / \)4 out

Chatbots, classification, high throughput

Gemini 3.1 Pro

Multimodal, long context

2M context, video + audio understanding

API only

\(1.25 in / \)5 out

Extremely long documents, video, multimodal

DeepSeek R1

Reasoning, self-hosted

Open-weight, strong reasoning

Requires GPU

Free (hosting cost)

Privacy + reasoning, cost-sensitive

DeepSeek V3.2

General, self-hosted

Excellent quality, MoE efficient

Requires GPU

Free (hosting cost)

Self-hosted production, general tasks

Llama 4 (Scout/Maverick)

Self-hosted production

Free, MoE, 10M context (Scout)

Large models

Free (hosting cost)

Privacy, cost-sensitive, long context

Qwen 3 (0.6-235B)

Multilingual, MoE

Best open-source, thinking modes

Requires hosting

Free (hosting cost)

Non-English, code, self-hosting

Phi-4

Small/edge, reasoning

Tiny but powerful (14B)

Limited vs larger models

Free (hosting cost)

Edge deployment, limited hardware

Decision Tree:

Need multimodal (images/vision/audio)? 
β”œβ”€ Yes β†’ GPT-5.4, Gemini 3.1 Pro, or Claude Sonnet 4.6
└─ No β†’ Need complex reasoning?
    β”œβ”€ Yes β†’ o3 (research) or o4-mini (coding) or DeepSeek R1 (self-hosted)
    └─ No β†’ Need self-hosting?
        β”œβ”€ Yes β†’ Qwen 3, Llama 4, or DeepSeek V3.2
        └─ No β†’ Budget conscious?
            β”œβ”€ Yes β†’ GPT-4.1-mini, Claude Haiku 4.5, or Gemini 3.1 Flash
            └─ No β†’ Claude Sonnet 4.6, GPT-5.4, or Gemini 3.1 Pro

Vision ModelsΒΆ

Choosing between vision-language models (April 2026)

Model

Best For

Pros

Cons

Cost

When to Use

GPT-5.4

Complex understanding

Best quality, multimodal native

Expensive

$2.50+/1M in

Complex scene understanding, OCR+reasoning

Gemini 3.1 Pro

Long video, multimodal

2M context, video + audio

API only

$1.25/1M in

Video analysis, long documents, multimodal

Claude Sonnet 4.6

Document + image analysis

Great at charts, PDFs, code screenshots

API only

$3/1M in

Document analysis, visual code review

CLIP / SigLIP

Embeddings, classification

Fast, zero-shot, embeddings

Limited understanding

Free (self-host)

Image search, zero-shot classification

Qwen2.5-VL

Advanced self-hosting

Best open-source VLM quality

Large model

Free (hosting)

High-quality self-hosted VQA

LLaVA-OneVision

Self-hosted VQA

Open-source, customizable

Requires GPU

Free (hosting)

Privacy needs, custom fine-tuning

InternVL 3

Research, benchmarks

Top open-source on many benchmarks

Complex setup

Free (hosting)

Research, multilingual vision

Use Case Guide:

  • Image Search/Similarity: CLIP or SigLIP (fast, efficient)

  • Image Captioning: GPT-5.4 (best quality) or Qwen2.5-VL (self-hosted)

  • Visual Question Answering: Gemini 3.1 Pro (production) or Qwen2.5-VL (self-hosted)

  • OCR + Understanding: GPT-5.4 or Claude Sonnet 4.6

  • Video Understanding: Gemini 3.1 Pro (native video support)

  • Zero-shot Classification: CLIP or SigLIP

Fine-tuning MethodsΒΆ

Which fine-tuning approach should you use?

Method

Parameters Trained

VRAM (7B model)

Training Time

Quality

When to Use

Full Fine-tuning

100%

40GB+

Days

Best

Research, unlimited resources

LoRA (r=64)

2-5%

8-12GB

Hours

Excellent

Most production use cases

QLoRA

2-5%

4-6GB

Hours

Very good

Limited GPU, good quality needed

DoRA

2-5%

8-12GB

Hours

Better than LoRA

Best quality with adapters

Adapters

<1%

6-8GB

Hours

Good

Quick experimentation

Prompt Tuning

<0.1%

Minimal

Minutes

Fair

Extremely limited resources

Decision Matrix:

GPU VRAM Available:
β”œβ”€ >40GB β†’ Full fine-tuning (best quality)
β”œβ”€ 16-40GB β†’ LoRA r=64 or DoRA (recommended)
β”œβ”€ 8-16GB β†’ LoRA r=32 or QLoRA
└─ <8GB β†’ QLoRA (4-bit) or Prompt Tuning

Quality Requirements:
β”œβ”€ Critical β†’ Full fine-tuning or DoRA
β”œβ”€ High β†’ LoRA r=64 or DoRA  
β”œβ”€ Medium β†’ LoRA r=32 or QLoRA
└─ Basic β†’ Adapters or Prompt Tuning

2026 Best Practices:

  • Default choice: LoRA with r=64, RSLoRA enabled

  • Budget GPU: QLoRA with 4-bit quantization (NF4)

  • Best quality: DoRA with r=64

  • Fastest training: Unsloth (2-5x speedup over standard HF)

  • Fastest iteration: Adapters or prompt tuning

Vector DatabasesΒΆ

Comparing vector database options

Database

Best For

Pros

Cons

Deployment

When to Use

ChromaDB

Development, prototyping

Easy setup, Python-native

Basic features

Embedded/Server

Quick prototypes, learning

Qdrant

Production, performance

Fast, feature-rich, Rust

More complex

Docker/Cloud

Production apps, high performance

Weaviate

Hybrid search

Built-in vectorization, GraphQL

Resource-heavy

Docker/Cloud

Hybrid search, complex schemas

Milvus

Scale, big data

Highly scalable, distributed

Complex setup

Kubernetes

Large-scale (millions+ vectors)

Pinecone

Managed service

Fully managed, easy

Expensive, vendor lock-in

Cloud only

Quick production, no ops

pgvector

Existing PostgreSQL

PostgreSQL extension, familiar

Limited features

PostgreSQL

Already using Postgres

FAISS

Research, offline

Fast similarity search

No persistence layer

In-memory

Research, benchmarking

Selection Guide:

By Scale:

  • <100K vectors: ChromaDB (simplest)

  • 100K-1M vectors: Qdrant or Weaviate

  • 1M-10M vectors: Qdrant or Milvus

  • >10M vectors: Milvus or Pinecone

By Use Case:

  • Learning/Prototyping: ChromaDB

  • Production RAG: Qdrant

  • Hybrid (keyword + vector): Weaviate

  • Existing Postgres: pgvector

  • No DevOps: Pinecone

  • Maximum scale: Milvus

Embedding ModelsΒΆ

Choosing the right embedding model (April 2026)

Model

Type

Dimensions

MTEB

Cost

When to Use

Gemini Embedding 001

API

3072 (MRL)

68.32 (#1)

~$0.004/1K chars

Best quality, cheapest API

Cohere Embed v4

API

1024

65.2

$0.12/1M tokens

Enterprise, 128K context, multilingual

text-embedding-3-large

API

3072 (MRL)

64.6

$0.13/1M

OpenAI ecosystem

Voyage 3.5

API

256-2048 (MRL)

~64

$0.06/1M

Best value API, 200M free tokens

text-embedding-3-small

API

1536

~62

$0.02/1M

Budget API

Qwen3-Embedding-8B

Local

32-7168 (MRL)

~64 / 70.58 MMTEB

Free

Best open-source, 100+ langs

BGE-M3

Local

1024

63.0

Free

Hybrid retrieval (dense+sparse)

all-mpnet-base-v2

Local

768

~59

Free

General purpose, local

all-MiniLM-L6-v2

Local

384

56.3

Free

Speed critical, prototyping

Decision Factors:

By Requirement:

  • Best Quality (API): Gemini Embedding (#1 MTEB) or Voyage 3.5

  • Best Quality (local): Qwen3-Embedding-8B

  • Cheapest API: Gemini Embedding (~free) then OpenAI Small ($0.02/1M)

  • Best Speed: all-MiniLM-L6-v2 (local) or Gemini Embedding (API)

  • Multilingual: Qwen3-Embedding (100+ langs, MMTEB #1) or Cohere v4

  • Multimodal: Gemini Embedding (all modalities) or Jina v4 (text+images+PDFs)

  • Long context: Cohere v4 (128K tokens) or Jina v4/Voyage (32K)

  • Domain-specific: Voyage (code-3, law-2, finance-2)

Typical Use Cases:

  • RAG chatbots: Gemini Embedding, Voyage 3.5, or all-mpnet-base-v2

  • Semantic search: Qwen3-Embedding (local) or Gemini Embedding (API)

  • Document Q&A: Cohere v4 (long context) or Gemini Embedding

  • Fast prototyping: all-MiniLM-L6-v2 or Gemini free tier

  • Hybrid search: BGE-M3 (dense + sparse + multi-vector)

Image GenerationΒΆ

Choosing image generation solutions (April 2026)

Solution

Quality

Speed

Cost

Control

When to Use

GPT-5.4 (native)

Excellent

Fast

API pricing

Low

Text+image generation in one model

DALL-E 3

Excellent

Fast

$0.04-0.12/image

Low

Quick results, no setup

Midjourney v7

Excellent

Medium

$10-60/month

Medium

Best artistic quality

FLUX 1.1 Pro

Best

Medium

API or self-host

High

Best open-source, photorealism

Stable Diffusion 3.5

Excellent

Medium

Free (GPU cost)

High

Production self-hosted

Ideogram 3

Excellent

Fast

API

Medium

Best text rendering in images

SDXL Turbo

Very good

Very fast

Free (GPU cost)

High

Fast iteration, prototyping

Decision Tree:

Need absolute best quality?
β”œβ”€ Yes, willing to pay β†’ Midjourney v7 or FLUX 1.1 Pro
β”œβ”€ Yes, self-hosted β†’ FLUX 1.1 dev
└─ No β†’ Need very fast generation?
    β”œβ”€ Yes β†’ SDXL Turbo or DALL-E 3
    └─ No β†’ Have GPU?
        β”œβ”€ Yes β†’ SD 3.5 or FLUX
        └─ No β†’ DALL-E 3 or Ideogram 3 (API)

By Use Case:

  • Product mockups: DALL-E 3 or Ideogram 3 (fast, reliable)

  • Artistic projects: Midjourney v7 or FLUX 1.1

  • Photorealistic images: FLUX 1.1 Pro

  • Text in images: Ideogram 3 (best text rendering)

  • High-volume generation: SDXL Turbo (self-hosted)

  • Custom fine-tuning: FLUX or SD 3.5

  • Production apps: FLUX 1.1 Pro or API (DALL-E)

Deployment OptionsΒΆ

Where and how to deploy your AI models

Option

Best For

Pros

Cons

Cost

When to Use

OpenAI API

Quick production

No infrastructure, reliable

Expensive at scale, data privacy

Pay per token

MVP, low-medium volume

Anthropic (Claude)

Production apps

Great models, reliable

Limited models

Pay per token

Production chatbots

Cloud GPU (AWS/GCP)

Self-hosted production

Full control, scalable

Complex setup, cost management

$1-5/hour

High-volume production

Modal

Serverless GPU

Auto-scaling, easy

Limited control

Pay per second

Bursty workloads

Replicate

Model hosting

Easy deployment, many models

Higher cost

Pay per prediction

Quick deployment

HuggingFace Inference

Quick hosting

Many models, easy

Limited free tier

Free/Pay

Testing, demos

Local (Ollama)

Development, privacy

Free, private, fast iteration

Limited by hardware

Free (power cost)

Development, sensitive data

vLLM

Self-hosted serving

Very fast, efficient

Requires setup

Hosting cost

Production self-hosting

TGI

HuggingFace models

Optimized for HF models

HF ecosystem only

Hosting cost

HuggingFace models

By Requirements:

Privacy Critical:

  • Local with Ollama or self-hosted vLLM

Cost Optimization:

  1. Low volume (<1M tokens/month): OpenAI API

  2. Medium (1-10M): Replicate or Claude

  3. High (>10M): Self-hosted on cloud GPU

Speed Requirements:

  • Fastest: vLLM (self-hosted) or OpenAI API

  • Low latency: Edge deployment (TensorFlow Lite, ONNX)

  • Batch processing: Cloud GPU with large batches

Development vs Production:

  • Development: Ollama (local) or OpenAI API

  • Production: vLLM (self-hosted) or OpenAI/Claude API

  • Hybrid: API for dev, self-hosted for production

Quick Decision GuidesΒΆ

β€œI want to build a RAG chatbotӢ

  1. Choose LLM:

    • Development: GPT-4.1-mini or Claude Haiku 4.5

    • Production (budget): Qwen 3 8B (self-hosted) or Gemini 3.1 Flash

    • Production (quality): Claude Sonnet 4.6 or GPT-5.4

  2. Choose Embeddings:

    • API (best): Gemini Embedding (cheapest + highest quality)

    • API (ecosystem): text-embedding-3-small or Voyage 3.5

    • Local: Qwen3-Embedding or all-mpnet-base-v2

  3. Choose Vector DB:

    • Learning: ChromaDB

    • Production: Qdrant

  4. Deployment:

    • MVP: OpenAI/Anthropic/Google API

    • Scale: vLLM + Cloud GPU

β€œI need to fine-tune a modelӢ

  1. Check if you really need fine-tuning:

    • Try prompt engineering first

    • Try RAG (retrieval augmented generation)

    • Only fine-tune if the above don’t work

  2. Choose method:

    • Have 16GB+ VRAM β†’ LoRA r=64 (use Unsloth for 2-5x speedup)

    • Have 8-16GB VRAM β†’ QLoRA

    • Need best quality β†’ DoRA

  3. Choose base model:

    • General: Qwen 3 (0.6-32B) or Llama 4 Scout

    • Coding: Qwen 3-Coder or DeepSeek Coder V2

    • Chat: Qwen 3-Instruct or Llama 4

    • Tiny/edge: Phi-4 (14B) or Qwen 3 (0.6B-4B)

β€œI need image understandingӢ

  1. Task type:

    • Classification β†’ CLIP or SigLIP

    • Detailed Q&A β†’ GPT-5.4 or Gemini 3.1 Pro

    • Self-hosted β†’ Qwen2.5-VL or LLaVA-OneVision

    • OCR + Understanding β†’ Claude Sonnet 4.6 or GPT-5.4

    • Video β†’ Gemini 3.1 Pro

Last Updated: April 2026
Repository: zero-to-ai

For detailed tutorials on any of these topics, see the relevant phase notebooks in the repository.