Claude Vision Patterns

Quick Guide: Use type: "image" content blocks for images (base64, URL, or file_id) and type: "document" content blocks for PDFs. Supported image formats: JPEG, PNG, GIF, WebP. Images before text in the content array improves results. Token cost formula: tokens = (width * height) / 750. Images are auto-resized if the long edge exceeds 1568px or exceeds ~1600 tokens. PDFs use type: "document" with media_type: "application/pdf". No OCR library needed -- Claude reads text directly from images and PDFs.

<critical_requirements>

CRITICAL: Before Using This Skill

All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering, import type, named constants)

(You MUST use type: "image" for images and type: "document" for PDFs -- they are different content block types)

(You MUST place images and documents BEFORE text in the content array -- Claude performs better with visual content first)

(You MUST always provide max_tokens in every request -- it is required and has no default)

(You MUST iterate over response.content blocks -- never assume a single text block in the response)

(You MUST use named constants for max_tokens, token budgets, and pixel limits -- no magic numbers)

</critical_requirements>

Auto-detection: Claude vision, image analysis, image input, base64 image, URL image, type image, type document, media_type image/jpeg, media_type image/png, image/webp, image/gif, application/pdf, PDF processing, document extraction, multimodal, multi-image, image comparison, chart analysis, screenshot analysis, image understanding, visual content, vision API

When to use:

Sending images to Claude for analysis, description, or data extraction
Processing PDF documents for text extraction, chart analysis, or summarization
Comparing multiple images in a single request
Extracting structured data from screenshots, receipts, charts, or forms
Building document processing pipelines with Claude
Estimating token costs for image-heavy workloads

Key patterns covered:

Image input via base64, URL, and Files API
PDF document input and processing
Multi-image requests and comparison patterns
Image + text prompting best practices
Token cost estimation and image sizing
Structured data extraction from visual content
Multi-turn vision conversations
Prompt caching with images and PDFs

When NOT to use:

General Claude API usage without images or documents -- use the general Anthropic SDK patterns instead
Image generation or editing -- Claude is understanding-only, it cannot create or modify images
Identifying specific people in images -- Claude refuses to name people (Anthropic policy)
Medical diagnostic imaging (CTs, MRIs) -- not designed for clinical diagnosis

Examples Index

Core: Image & PDF Input -- Base64, URL, file_id, PDF input, multi-image, token estimation
Extraction & Prompting -- Structured extraction, comparison, prompting best practices, caching
Quick API Reference -- Content block types, supported formats, size limits, token formula

Philosophy

Claude's vision capabilities treat images and documents as first-class content blocks alongside text. There is no separate "vision API" -- you add image or document blocks to the same Messages API you already use for text.

Core principles:

Images are content blocks, not attachments -- Images and PDFs are content blocks in the messages array, interleaved with text. They are not uploaded separately or referenced by URL-only.
Image-first ordering -- Place images before text in the content array. This mirrors how documents first, query last improves text prompts. Claude processes visual content better when it sees the image before the question.
No OCR needed -- Claude reads text directly from images and PDFs. You do not need to pre-extract text with an OCR library. For PDFs, Claude processes both the extracted text and a rendered image of each page.
Token costs scale with pixels -- Image tokens are proportional to resolution: tokens = (width * height) / 750. Downsizing images before sending saves tokens without losing meaningful detail for most use cases.
PDFs are dual-processed -- Each PDF page is converted to an image AND has its text extracted. Claude sees both, giving it access to visual layout and textual content.

When to use vision:

Analyzing screenshots, photos, charts, diagrams, or infographics
Extracting data from forms, receipts, invoices, or tables
Processing PDF documents for summarization, extraction, or analysis
Comparing multiple images (before/after, A/B testing, design review)
Understanding visual context that text alone cannot capture

When NOT to use:

Pure text tasks with no visual component -- vision adds unnecessary token cost
Tasks requiring pixel-perfect spatial precision -- Claude's spatial reasoning is approximate
Identifying specific people -- Claude refuses to name individuals (Anthropic policy)
Replacing professional medical imaging analysis (CTs, MRIs, X-rays)

</philosophy>

Core Patterns

Pattern 1: Base64 Image Input

Read a local file, encode to base64, send as type: "image" content block. Image block before text block.

// Image block first, text prompt second, iterate response content blocks
content: [
  {
    type: "image",
    source: { type: "base64", media_type: "image/png", data: imageData },
  },
  { type: "text", text: "Describe what you see in this image." },
];

Why good: Image before text improves results, explicit media_type, structured content blocks

// BAD: base64 as text string -- Claude cannot interpret raw base64
content: "What's in this image? " + imageData;

Why bad: Passing base64 as text string instead of image content block, Claude cannot interpret raw base64 text as an image

See: examples/core.md for full runnable examples with base64, URL, and Files API

Pattern 2: URL vs Base64 vs Files API

Three source types for images. Choose based on where your image lives.

// URL source -- simplest, smallest payload
source: { type: "url", url: "https://example.com/chart.png" }

// Base64 source -- local files
source: { type: "base64", media_type: "image/jpeg", data: base64String }

// Files API source (beta) -- upload once, reuse across requests
source: { type: "file", file_id: "file_abc123" }

When to use: URL for hosted images, base64 for local files, Files API for multi-turn or repeated use

See: examples/core.md for full examples of each source type

Pattern 3: PDF Document Input

PDFs use type: "document" -- different from type: "image". This is the most common mistake.

// Correct: type "document" for PDFs
{ type: "document", source: { type: "base64", media_type: "application/pdf", data: pdfData } }

// WRONG: type "image" for PDFs -- causes API errors
{ type: "image", source: { type: "base64", media_type: "application/pdf", data: pdfData } }

Why good: type: "document" enables dual processing (text extraction + page rendering)

Why bad: Using type: "image" for PDFs causes API errors. PDFs require type: "document".

See: examples/core.md for base64 and URL PDF examples, examples/extraction.md for PDF caching

Pattern 4: Multiple Images with Labels

Label images with text blocks so Claude can reference them clearly.

content: [
  { type: "text", text: "Image 1:" },
  {
    type: "image",
    source: { type: "base64", media_type: "image/jpeg", data: image1 },
  },
  { type: "text", text: "Image 2:" },
  {
    type: "image",
    source: { type: "base64", medi

ai-provider-claude-vision

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

pdf

pptx

canvas-design

theme-factory

Recibe nuevas skills de Documentos todos los lunes