Claude Vision Patterns
Quick Guide: Use
type: "image"content blocks for images (base64, URL, or file_id) andtype: "document"content blocks for PDFs. Supported image formats: JPEG, PNG, GIF, WebP. Images before text in the content array improves results. Token cost formula:tokens = (width * height) / 750. Images are auto-resized if the long edge exceeds 1568px or exceeds ~1600 tokens. PDFs usetype: "document"withmedia_type: "application/pdf". No OCR library needed -- Claude reads text directly from images and PDFs.
<critical_requirements>
CRITICAL: Before Using This Skill
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST use type: "image" for images and type: "document" for PDFs -- they are different content block types)
(You MUST place images and documents BEFORE text in the content array -- Claude performs better with visual content first)
(You MUST always provide max_tokens in every request -- it is required and has no default)
(You MUST iterate over response.content blocks -- never assume a single text block in the response)
(You MUST use named constants for max_tokens, token budgets, and pixel limits -- no magic numbers)
</critical_requirements>
Auto-detection: Claude vision, image analysis, image input, base64 image, URL image, type image, type document, media_type image/jpeg, media_type image/png, image/webp, image/gif, application/pdf, PDF processing, document extraction, multimodal, multi-image, image comparison, chart analysis, screenshot analysis, image understanding, visual content, vision API
When to use:
- Sending images to Claude for analysis, description, or data extraction
- Processing PDF documents for text extraction, chart analysis, or summarization
- Comparing multiple images in a single request
- Extracting structured data from screenshots, receipts, charts, or forms
- Building document processing pipelines with Claude
- Estimating token costs for image-heavy workloads
Key patterns covered:
- Image input via base64, URL, and Files API
- PDF document input and processing
- Multi-image requests and comparison patterns
- Image + text prompting best practices
- Token cost estimation and image sizing
- Structured data extraction from visual content
- Multi-turn vision conversations
- Prompt caching with images and PDFs
When NOT to use:
- General Claude API usage without images or documents -- use the general Anthropic SDK patterns instead
- Image generation or editing -- Claude is understanding-only, it cannot create or modify images
- Identifying specific people in images -- Claude refuses to name people (Anthropic policy)
- Medical diagnostic imaging (CTs, MRIs) -- not designed for clinical diagnosis
Examples Index
- Core: Image & PDF Input -- Base64, URL, file_id, PDF input, multi-image, token estimation
- Extraction & Prompting -- Structured extraction, comparison, prompting best practices, caching
- Quick API Reference -- Content block types, supported formats, size limits, token formula
<philosophy>
Philosophy
Claude's vision capabilities treat images and documents as first-class content blocks alongside text. There is no separate "vision API" -- you add image or document blocks to the same Messages API you already use for text.
Core principles:
- Images are content blocks, not attachments -- Images and PDFs are content blocks in the
messagesarray, interleaved with text. They are not uploaded separately or referenced by URL-only. - Image-first ordering -- Place images before text in the content array. This mirrors how
documents first, query lastimproves text prompts. Claude processes visual content better when it sees the image before the question. - No OCR needed -- Claude reads text directly from images and PDFs. You do not need to pre-extract text with an OCR library. For PDFs, Claude processes both the extracted text and a rendered image of each page.
- Token costs scale with pixels -- Image tokens are proportional to resolution:
tokens = (width * height) / 750. Downsizing images before sending saves tokens without losing meaningful detail for most use cases. - PDFs are dual-processed -- Each PDF page is converted to an image AND has its text extracted. Claude sees both, giving it access to visual layout and textual content.
When to use vision:
- Analyzing screenshots, photos, charts, diagrams, or infographics
- Extracting data from forms, receipts, invoices, or tables
- Processing PDF documents for summarization, extraction, or analysis
- Comparing multiple images (before/after, A/B testing, design review)
- Understanding visual context that text alone cannot capture
When NOT to use:
- Pure text tasks with no visual component -- vision adds unnecessary token cost
- Tasks requiring pixel-perfect spatial precision -- Claude's spatial reasoning is approximate
- Identifying specific people -- Claude refuses to name individuals (Anthropic policy)
- Replacing professional medical imaging analysis (CTs, MRIs, X-rays)
<patterns>
Core Patterns
Pattern 1: Base64 Image Input
Read a local file, encode to base64, send as type: "image" content block. Image block before text block.
// Image block first, text prompt second, iterate response content blocks
content: [
{
type: "image",
source: { type: "base64", media_type: "image/png", data: imageData },
},
{ type: "text", text: "Describe what you see in this image." },
];
Why good: Image before text improves results, explicit media_type, structured content blocks
// BAD: base64 as text string -- Claude cannot interpret raw base64
content: "What's in this image? " + imageData;
Why bad: Passing base64 as text string instead of image content block, Claude cannot interpret raw base64 text as an image
See: examples/core.md for full runnable examples with base64, URL, and Files API
Pattern 2: URL vs Base64 vs Files API
Three source types for images. Choose based on where your image lives.
// URL source -- simplest, smallest payload
source: { type: "url", url: "https://example.com/chart.png" }
// Base64 source -- local files
source: { type: "base64", media_type: "image/jpeg", data: base64String }
// Files API source (beta) -- upload once, reuse across requests
source: { type: "file", file_id: "file_abc123" }
When to use: URL for hosted images, base64 for local files, Files API for multi-turn or repeated use
See: examples/core.md for full examples of each source type
Pattern 3: PDF Document Input
PDFs use type: "document" -- different from type: "image". This is the most common mistake.
// Correct: type "document" for PDFs
{ type: "document", source: { type: "base64", media_type: "application/pdf", data: pdfData } }
// WRONG: type "image" for PDFs -- causes API errors
{ type: "image", source: { type: "base64", media_type: "application/pdf", data: pdfData } }
Why good: type: "document" enables dual processing (text extraction + page rendering)
Why bad: Using type: "image" for PDFs causes API errors. PDFs require type: "document".
See: examples/core.md for base64 and URL PDF examples, examples/extraction.md for PDF caching
Pattern 4: Multiple Images with Labels
Label images with text blocks so Claude can reference them clearly.
content: [
{ type: "text", text: "Image 1:" },
{
type: "image",
source: { type: "base64", media_type: "image/jpeg", data: image1 },
},
{ type: "text", text: "Image 2:" },
{
type: "image",
source: { type: "base64", medi