Vision-Language Models: GPT-4V, LLaVA & Gemini VisionΒΆ
Models that understand both images and text β the foundation of modern multimodal AI.
What Are Vision-Language Models?ΒΆ
Vision-Language Models (VLMs) accept image + text as input and produce text as output.
Architecture: Image encoder (ViT) + projection layer + LLM decoder
Image β [ViT Encoder] β [Projection] ββ
ββ [LLM] β Text response
Text prompt βββββββββββββββββββββββββββββ
Key models in 2026:
Model |
Provider |
Key Strength |
|---|---|---|
GPT-4V / GPT-4o |
OpenAI |
Best general VQA, document understanding |
Claude 3.5 Sonnet |
Anthropic |
Long context + vision, chart reading |
Gemini 1.5 Pro |
1M token context with video |
|
LLaVA 1.6 |
Open source |
Local deployment, fine-tunable |
Qwen-VL |
Alibaba |
Strong multilingual + vision |
Phi-3 Vision |
Microsoft |
Lightweight, runs on CPU |
# Install dependencies
# !pip install openai anthropic pillow requests
1. GPT-4V via OpenAI APIΒΆ
import base64
import requests
from pathlib import Path
from openai import OpenAI
client = OpenAI() # uses OPENAI_API_KEY env var
def encode_image_base64(image_path: str) -> str:
"""Encode a local image to base64 for API submission."""
with open(image_path, 'rb') as f:
return base64.b64encode(f.read()).decode('utf-8')
def ask_about_image(image_path: str, question: str, model: str = 'gpt-4o') -> str:
"""Ask GPT-4V a question about a local image."""
image_data = encode_image_base64(image_path)
ext = Path(image_path).suffix.lower().lstrip('.')
media_type = {'jpg': 'jpeg', 'jpeg': 'jpeg', 'png': 'png', 'gif': 'gif', 'webp': 'webp'}.get(ext, 'jpeg')
response = client.chat.completions.create(
model=model,
messages=[
{
'role': 'user',
'content': [
{'type': 'text', 'text': question},
{'type': 'image_url', 'image_url': {'url': f'data:image/{media_type};base64,{image_data}'}}
]
}
],
max_tokens=500
)
return response.choices[0].message.content
def ask_about_image_url(image_url: str, question: str, model: str = 'gpt-4o') -> str:
"""Ask GPT-4V a question about an image at a URL."""
response = client.chat.completions.create(
model=model,
messages=[
{
'role': 'user',
'content': [
{'type': 'text', 'text': question},
{'type': 'image_url', 'image_url': {'url': image_url}}
]
}
],
max_tokens=500
)
return response.choices[0].message.content
# Test with a URL
test_url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png'
answer = ask_about_image_url(test_url, 'What objects do you see in this image? Describe briefly.')
print('GPT-4o response:', answer)
2. Claude Vision (Anthropic)ΒΆ
import anthropic
import base64
claude = anthropic.Anthropic() # uses ANTHROPIC_API_KEY
def ask_claude_about_image(image_path: str, question: str) -> str:
"""Ask Claude Vision about a local image."""
with open(image_path, 'rb') as f:
image_data = base64.b64encode(f.read()).decode('utf-8')
ext = Path(image_path).suffix.lower().lstrip('.')
media_type = f'image/{"jpeg" if ext in ["jpg", "jpeg"] else ext}'
message = claude.messages.create(
model='claude-sonnet-4-6',
max_tokens=500,
messages=[
{
'role': 'user',
'content': [
{'type': 'image', 'source': {'type': 'base64', 'media_type': media_type, 'data': image_data}},
{'type': 'text', 'text': question}
]
}
]
)
return message.content[0].text
print('Claude Vision function defined β pass a local image path to use it.')
3. Use Cases: Document UnderstandingΒΆ
# Common VLM tasks
VLM_USE_CASES = {
'Visual QA': 'What is happening in this image?',
'Chart reading': 'Extract all data points from this chart as a JSON object.',
'OCR + reasoning': 'Read the text in this document and summarize the key points.',
'Code from diagram': 'Convert this UML diagram to Python class definitions.',
'Receipt parsing': 'Extract: vendor, date, total, and line items from this receipt.',
'Medical imaging': 'Describe any anomalies visible in this X-ray (educational only).',
'Product catalog': 'Extract: product name, price, and features from this product image.',
'Accessibility': 'Write detailed alt text for this image for screen reader users.',
}
print('Common VLM prompts by use case:')
for use_case, prompt in VLM_USE_CASES.items():
print(f' {use_case:20s} β "{prompt[:60]}..."' if len(prompt) > 60 else f' {use_case:20s} β "{prompt}"')
4. LLaVA β Open Source, Local DeploymentΒΆ
# LLaVA runs locally via Ollama β no API key needed
# Setup: ollama pull llava
import requests
import base64
def ask_llava_local(image_path: str, question: str, model: str = 'llava') -> str:
"""Ask LLaVA (running locally via Ollama) about an image."""
with open(image_path, 'rb') as f:
image_b64 = base64.b64encode(f.read()).decode('utf-8')
response = requests.post(
'http://localhost:11434/api/generate',
json={
'model': model,
'prompt': question,
'images': [image_b64],
'stream': False
}
)
return response.json()['response']
print('LLaVA local function defined.')
print('To use: ollama pull llava && ollama serve')
print('Then call: ask_llava_local("image.jpg", "What is in this image?")')
5. Model ComparisonΒΆ
Model |
Speed |
Cost |
Quality |
Local? |
|---|---|---|---|---|
GPT-4o |
Fast |
$$$ |
β β β β β |
No |
Claude 3.5 Sonnet |
Fast |
$$$ |
β β β β β |
No |
Gemini 1.5 Flash |
Very fast |
$ |
β β β β |
No |
LLaVA 1.6 (7B) |
Slow |
Free |
β β β |
β Yes |
Phi-3 Vision (4B) |
Fast |
Free |
β β β |
β Yes |
Qwen-VL (7B) |
Medium |
Free |
β β β β |
β Yes |
Rule of thumb: Use GPT-4o/Claude for production, LLaVA/Phi-3 for local dev/privacy.
ExercisesΒΆ
Take a photo of a receipt and use GPT-4V to extract the items and total as JSON.
Use Claude Vision to describe a chart from a research paper.
Run LLaVA locally with Ollama and compare its output to GPT-4o on the same image.
Build a simple image Q&A app: take a URL, ask a question, return the answer.