Vision-Language Models: GPT-4V, LLaVA & Gemini VisionΒΆ

Models that understand both images and text β€” the foundation of modern multimodal AI.

What Are Vision-Language Models?ΒΆ

Vision-Language Models (VLMs) accept image + text as input and produce text as output.

Architecture: Image encoder (ViT) + projection layer + LLM decoder

Image β†’ [ViT Encoder] β†’ [Projection] ─┐
                                        β”œβ†’ [LLM] β†’ Text response
Text prompt β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key models in 2026:

Model

Provider

Key Strength

GPT-4V / GPT-4o

OpenAI

Best general VQA, document understanding

Claude 3.5 Sonnet

Anthropic

Long context + vision, chart reading

Gemini 1.5 Pro

Google

1M token context with video

LLaVA 1.6

Open source

Local deployment, fine-tunable

Qwen-VL

Alibaba

Strong multilingual + vision

Phi-3 Vision

Microsoft

Lightweight, runs on CPU

# Install dependencies
# !pip install openai anthropic pillow requests

1. GPT-4V via OpenAI APIΒΆ

import base64
import requests
from pathlib import Path
from openai import OpenAI

client = OpenAI()  # uses OPENAI_API_KEY env var

def encode_image_base64(image_path: str) -> str:
    """Encode a local image to base64 for API submission."""
    with open(image_path, 'rb') as f:
        return base64.b64encode(f.read()).decode('utf-8')

def ask_about_image(image_path: str, question: str, model: str = 'gpt-4o') -> str:
    """Ask GPT-4V a question about a local image."""
    image_data = encode_image_base64(image_path)
    ext = Path(image_path).suffix.lower().lstrip('.')
    media_type = {'jpg': 'jpeg', 'jpeg': 'jpeg', 'png': 'png', 'gif': 'gif', 'webp': 'webp'}.get(ext, 'jpeg')

    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                'role': 'user',
                'content': [
                    {'type': 'text', 'text': question},
                    {'type': 'image_url', 'image_url': {'url': f'data:image/{media_type};base64,{image_data}'}}
                ]
            }
        ],
        max_tokens=500
    )
    return response.choices[0].message.content

def ask_about_image_url(image_url: str, question: str, model: str = 'gpt-4o') -> str:
    """Ask GPT-4V a question about an image at a URL."""
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                'role': 'user',
                'content': [
                    {'type': 'text', 'text': question},
                    {'type': 'image_url', 'image_url': {'url': image_url}}
                ]
            }
        ],
        max_tokens=500
    )
    return response.choices[0].message.content

# Test with a URL
test_url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png'
answer = ask_about_image_url(test_url, 'What objects do you see in this image? Describe briefly.')
print('GPT-4o response:', answer)

2. Claude Vision (Anthropic)ΒΆ

import anthropic
import base64

claude = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY

def ask_claude_about_image(image_path: str, question: str) -> str:
    """Ask Claude Vision about a local image."""
    with open(image_path, 'rb') as f:
        image_data = base64.b64encode(f.read()).decode('utf-8')

    ext = Path(image_path).suffix.lower().lstrip('.')
    media_type = f'image/{"jpeg" if ext in ["jpg", "jpeg"] else ext}'

    message = claude.messages.create(
        model='claude-sonnet-4-6',
        max_tokens=500,
        messages=[
            {
                'role': 'user',
                'content': [
                    {'type': 'image', 'source': {'type': 'base64', 'media_type': media_type, 'data': image_data}},
                    {'type': 'text', 'text': question}
                ]
            }
        ]
    )
    return message.content[0].text

print('Claude Vision function defined β€” pass a local image path to use it.')

3. Use Cases: Document UnderstandingΒΆ

# Common VLM tasks
VLM_USE_CASES = {
    'Visual QA':         'What is happening in this image?',
    'Chart reading':     'Extract all data points from this chart as a JSON object.',
    'OCR + reasoning':   'Read the text in this document and summarize the key points.',
    'Code from diagram': 'Convert this UML diagram to Python class definitions.',
    'Receipt parsing':   'Extract: vendor, date, total, and line items from this receipt.',
    'Medical imaging':   'Describe any anomalies visible in this X-ray (educational only).',
    'Product catalog':   'Extract: product name, price, and features from this product image.',
    'Accessibility':     'Write detailed alt text for this image for screen reader users.',
}

print('Common VLM prompts by use case:')
for use_case, prompt in VLM_USE_CASES.items():
    print(f'  {use_case:20s} β†’ "{prompt[:60]}..."' if len(prompt) > 60 else f'  {use_case:20s} β†’ "{prompt}"')

4. LLaVA β€” Open Source, Local DeploymentΒΆ

# LLaVA runs locally via Ollama β€” no API key needed
# Setup: ollama pull llava

import requests
import base64

def ask_llava_local(image_path: str, question: str, model: str = 'llava') -> str:
    """Ask LLaVA (running locally via Ollama) about an image."""
    with open(image_path, 'rb') as f:
        image_b64 = base64.b64encode(f.read()).decode('utf-8')

    response = requests.post(
        'http://localhost:11434/api/generate',
        json={
            'model': model,
            'prompt': question,
            'images': [image_b64],
            'stream': False
        }
    )
    return response.json()['response']

print('LLaVA local function defined.')
print('To use: ollama pull llava && ollama serve')
print('Then call: ask_llava_local("image.jpg", "What is in this image?")')

5. Model ComparisonΒΆ

Model

Speed

Cost

Quality

Local?

GPT-4o

Fast

$$$

β˜…β˜…β˜…β˜…β˜…

No

Claude 3.5 Sonnet

Fast

$$$

β˜…β˜…β˜…β˜…β˜…

No

Gemini 1.5 Flash

Very fast

$

β˜…β˜…β˜…β˜…

No

LLaVA 1.6 (7B)

Slow

Free

β˜…β˜…β˜…

βœ… Yes

Phi-3 Vision (4B)

Fast

Free

β˜…β˜…β˜…

βœ… Yes

Qwen-VL (7B)

Medium

Free

β˜…β˜…β˜…β˜…

βœ… Yes

Rule of thumb: Use GPT-4o/Claude for production, LLaVA/Phi-3 for local dev/privacy.

ExercisesΒΆ

  1. Take a photo of a receipt and use GPT-4V to extract the items and total as JSON.

  2. Use Claude Vision to describe a chart from a research paper.

  3. Run LLaVA locally with Ollama and compare its output to GPT-4o on the same image.

  4. Build a simple image Q&A app: take a URL, ask a question, return the answer.