Run this notebook: Open in Colab Open in Kaggle

Vision-Language Models: GPT-4V, LLaVA & Gemini Vision¶

Models that understand both images and text — the foundation of modern multimodal AI.

What Are Vision-Language Models?¶

Vision-Language Models (VLMs) accept image + text as input and produce text as output.

Architecture: Image encoder (ViT) + projection layer + LLM decoder

Image → [ViT Encoder] → [Projection] ─┐
                                        ├→ [LLM] → Text response
Text prompt ────────────────────────────┘

Key models in 2026:

Model	Provider	Key Strength
GPT-4V / GPT-4o	OpenAI	Best general VQA, document understanding
Claude 3.5 Sonnet	Anthropic	Long context + vision, chart reading
Gemini 1.5 Pro	Google	1M token context with video
LLaVA 1.6	Open source	Local deployment, fine-tunable
Qwen-VL	Alibaba	Strong multilingual + vision
Phi-3 Vision	Microsoft	Lightweight, runs on CPU

# Install dependencies
# !pip install openai anthropic pillow requests

1. GPT-4V via OpenAI API¶

import base64
import requests
from pathlib import Path
from openai import OpenAI

client = OpenAI()  # uses OPENAI_API_KEY env var

def encode_image_base64(image_path: str) -> str:
    """Encode a local image to base64 for API submission."""
    with open(image_path, 'rb') as f:
        return base64.b64encode(f.read()).decode('utf-8')

def ask_about_image(image_path: str, question: str, model: str = 'gpt-4o') -> str:
    """Ask GPT-4V a question about a local image."""
    image_data = encode_image_base64(image_path)
    ext = Path(image_path).suffix.lower().lstrip('.')
    media_type = {'jpg': 'jpeg', 'jpeg': 'jpeg', 'png': 'png', 'gif': 'gif', 'webp': 'webp'}.get(ext, 'jpeg')

    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                'role': 'user',
                'content': [
                    {'type': 'text', 'text': question},
                    {'type': 'image_url', 'image_url': {'url': f'data:image/{media_type};base64,{image_data}'}}
                ]
            }
        ],
        max_tokens=500
    )
    return response.choices[0].message.content

def ask_about_image_url(image_url: str, question: str, model: str = 'gpt-4o') -> str:
    """Ask GPT-4V a question about an image at a URL."""
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                'role': 'user',
                'content': [
                    {'type': 'text', 'text': question},
                    {'type': 'image_url', 'image_url': {'url': image_url}}
                ]
            }
        ],
        max_tokens=500
    )
    return response.choices[0].message.content

# Test with a URL
test_url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png'
answer = ask_about_image_url(test_url, 'What objects do you see in this image? Describe briefly.')
print('GPT-4o response:', answer)

2. Claude Vision (Anthropic)¶

import anthropic
import base64

claude = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY

def ask_claude_about_image(image_path: str, question: str) -> str:
    """Ask Claude Vision about a local image."""
    with open(image_path, 'rb') as f:
        image_data = base64.b64encode(f.read()).decode('utf-8')

    ext = Path(image_path).suffix.lower().lstrip('.')
    media_type = f'image/{"jpeg" if ext in ["jpg", "jpeg"] else ext}'

    message = claude.messages.create(
        model='claude-sonnet-4-6',
        max_tokens=500,
        messages=[
            {
                'role': 'user',
                'content': [
                    {'type': 'image', 'source': {'type': 'base64', 'media_type': media_type, 'data': image_data}},
                    {'type': 'text', 'text': question}
                ]
            }
        ]
    )
    return message.content[0].text

print('Claude Vision function defined — pass a local image path to use it.')

3. Use Cases: Document Understanding¶

# Common VLM tasks
VLM_USE_CASES = {
    'Visual QA':         'What is happening in this image?',
    'Chart reading':     'Extract all data points from this chart as a JSON object.',
    'OCR + reasoning':   'Read the text in this document and summarize the key points.',
    'Code from diagram': 'Convert this UML diagram to Python class definitions.',
    'Receipt parsing':   'Extract: vendor, date, total, and line items from this receipt.',
    'Medical imaging':   'Describe any anomalies visible in this X-ray (educational only).',
    'Product catalog':   'Extract: product name, price, and features from this product image.',
    'Accessibility':     'Write detailed alt text for this image for screen reader users.',
}

print('Common VLM prompts by use case:')
for use_case, prompt in VLM_USE_CASES.items():
    print(f'  {use_case:20s} → "{prompt[:60]}..."' if len(prompt) > 60 else f'  {use_case:20s} → "{prompt}"')

4. LLaVA — Open Source, Local Deployment¶

# LLaVA runs locally via Ollama — no API key needed
# Setup: ollama pull llava

import requests
import base64

def ask_llava_local(image_path: str, question: str, model: str = 'llava') -> str:
    """Ask LLaVA (running locally via Ollama) about an image."""
    with open(image_path, 'rb') as f:
        image_b64 = base64.b64encode(f.read()).decode('utf-8')

    response = requests.post(
        'http://localhost:11434/api/generate',
        json={
            'model': model,
            'prompt': question,
            'images': [image_b64],
            'stream': False
        }
    )
    return response.json()['response']

print('LLaVA local function defined.')
print('To use: ollama pull llava && ollama serve')
print('Then call: ask_llava_local("image.jpg", "What is in this image?")')

5. Model Comparison¶

Model	Speed	Cost	Quality	Local?
GPT-4o	Fast	$$$	★★★★★	No
Claude 3.5 Sonnet	Fast	$$$	★★★★★	No
Gemini 1.5 Flash	Very fast	$	★★★★	No
LLaVA 1.6 (7B)	Slow	Free	★★★	✅ Yes
Phi-3 Vision (4B)	Fast	Free	★★★	✅ Yes
Qwen-VL (7B)	Medium	Free	★★★★	✅ Yes

Rule of thumb: Use GPT-4o/Claude for production, LLaVA/Phi-3 for local dev/privacy.

Exercises¶

Take a photo of a receipt and use GPT-4V to extract the items and total as JSON.
Use Claude Vision to describe a chart from a research paper.
Run LLaVA locally with Ollama and compare its output to GPT-4o on the same image.
Build a simple image Q&A app: take a URL, ask a question, return the answer.