Web Search

Web search, scraping, and content extraction for AI coding agents. Zero API keys required. Five tools organized in fallback chains: WebSearch and Crawl4AI as primary, Jina as secondary, duckduckgo-search and WebFetch as fallbacks. Use when your agent needs web information -- finding pages, extracting content, or conducting research.

Terminology used in this file:

Playwright: Browser automation framework used by Crawl4AI for JavaScript-rendered pages.
SPA: Single-page application; content is rendered dynamically in JavaScript.
MCP: Model Context Protocol, a standard for exposing tool servers to AI agents.

Setup

python3 -m pip install crawl4ai duckduckgo-search
crawl4ai-setup

Claude Code: copy this skill folder into .claude/skills/web-search/
Codex CLI: append this SKILL.md content to your project's root AGENTS.md

For the full installation walkthrough (prerequisites, verification, troubleshooting), see references/installation-guide.md.

Staying Updated

This skill ships with an UPDATES.md changelog and UPDATE-GUIDE.md for your AI agent.

After installing, tell your agent: "Check UPDATES.md in the web-search skill for any new features or changes."

When updating, tell your agent: "Read UPDATE-GUIDE.md and apply the latest changes from UPDATES.md."

Follow UPDATE-GUIDE.md so customized local files are diffed before any overwrite.

Quick Start

Run this minimal fallback-safe sequence:

# 1) Find candidate pages
python3 -c "from duckduckgo_search import DDGS; import json; print(json.dumps(DDGS().text('your query', max_results=5), indent=2))"

# 2) Extract one page quickly (no local deps)
curl -s "https://r.jina.ai/http://example.com/article" | head -80

# 3) Escalate to Crawl4AI if JS rendering is needed
crwl https://example.com/app --f markdown --bypass-cache

Use this routing rule: search with WebSearch first, extract with Jina/WebFetch for simple pages, escalate to Crawl4AI for JS-heavy targets.

Decision Tree

Need info from the web?
  |
  +-- Need to SEARCH for pages/answers?
  |     +-- Default first choice --> WebSearch (built-in, zero setup)
  |     +-- WebSearch unavailable? --> Jina s.jina.ai (no key needed)
  |     +-- Both fail? --> duckduckgo-search Python lib (emergency fallback)
  |
  +-- Need to EXTRACT content from a known URL?
  |     +-- JS-heavy SPA, dynamic content? --> Crawl4AI crwl (full browser rendering)
  |     +-- Simple text page (article, docs, blog)? --> Jina r.jina.ai (fast, no install)
  |     +-- Jina/Crawl4AI unavailable? --> WebFetch (built-in, AI-summarized)
  |     +-- Need structured data extraction? --> Crawl4AI with extraction strategy
  |     +-- Multiple URLs in batch? --> Crawl4AI batch mode
  |
  +-- Need DEEP RESEARCH (search + extract + combine)?
        --> WebSearch to find URLs --> Crawl4AI/Jina extract each --> synthesize

Rule of thumb: WebSearch for finding, Jina for reading, Crawl4AI for rendering.

Tool Reference

WebSearch (Built-in) -- Primary Search

What: Claude Code built-in web search tool. Returns search results with links and snippets. Install required: None (built-in to Claude Code) Strengths: Zero setup, zero API keys, integrated into agent workflow, always available Weaknesses: No direct SDK/CLI access (tool-only), results are search-result blocks not raw JSON

# Invoked as a Claude Code tool:
WebSearch(query="your search query")

# Supports domain filtering:
WebSearch(query="your query", allowed_domains=["docs.python.org"])
WebSearch(query="your query", blocked_domains=["pinterest.com"])

Returns: Search result blocks with titles, URLs, and content snippets.

WebFetch (Built-in) -- Fallback URL Extraction

What: Claude Code built-in URL fetcher. Fetches page content, converts HTML to markdown, processes with AI. Install required: None (built-in to Claude Code) Strengths: Zero setup, AI-processed output, handles redirects, 15-min cache Weaknesses: Cannot handle authenticated/private URLs, may summarize large content

# Invoked as a Claude Code tool:
WebFetch(url="https://example.com/page", prompt="Extract the main content")

Limitations:

Will fail for authenticated URLs (Google Docs, Confluence, Jira, private GitHub)
HTTP auto-upgraded to HTTPS
Large content may be summarized rather than returned in full
When redirected to a different host, returns redirect URL instead of content

Crawl4AI -- JS-Rendering Web Scraper

What: Open-source scraper with full Playwright browser rendering. Outputs LLM-friendly markdown. Install required: pip install crawl4ai && crawl4ai-setup Strengths: Full JS rendering, handles SPAs, batch crawling, structured extraction Weaknesses: Requires Playwright install, heavier than Jina

# CLI (simplest)
crwl https://example.com
crwl https://example.com -o markdown

# Python API
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url='https://example.com')
    print(result.markdown)

Jina Reader/Search -- Zero-Install Extraction & Search

What: URL-to-markdown converter and search via HTTP API. No install needed -- just curl. API key: Not required. JINA_API_KEY is optional and only increases rate limits. Strengths: Zero install, fast (~1s), works everywhere curl works, search + extract in one service Weaknesses: No JS rendering, rate limited without API key

# Read a URL (returns markdown)
curl -s 'https://r.jina.ai/https://example.com'

# Search (returns search results)
curl -s 'https://s.jina.ai/your+search+query'

# With API key (higher rate limits, optional)
curl -s -H "Authorization: Bearer $JINA_API_KEY" 'https://r.jina.ai/https://example.com'

duckduckgo-search -- Emergency Search Fallback

What: Python library for DuckDuckGo search. Zero API keys, zero registration. Install required: pip install duckduckgo-search Strengths: Completely free, no API key, no rate limit concerns, reliable fallback Weaknesses: Less AI-optimized results than WebSearch, Python-only

from duckduckgo_search import DDGS
results = DDGS().text("your query", max_results=5)
for r in results:
    print(r['title'], r['href'], r['body'])

# One-liner from CLI
python3 -c "from duckduckgo_search import DDGS; import json; print(json.dumps(DDGS().text('your query', max_results=5), indent=2))"

Core Workflows

Pattern 1: Quick Web Search

When: Need factual answers or find relevant pages

Use WebSearch: WebSearch(query="your query here")
Parse results: each result has title, URL, and content snippet
Fallback: curl -s 'https://s.jina.ai/your+query+here'
Emergency: python3 -c "from duckduckgo_search import DDGS; ..."

Pattern 2: URL Content Extraction

When: Have a URL, need its content as clean text/markdown

a) JS-heavy site: crwl URL (Crawl4AI, full rendering) b) Lightweight static page: curl -s 'https://r.jina.ai/URL' (Jina) c) Both fail: WebFetch(url="URL", prompt="Extract the main content")

Decision: Is it a SPA/JS-heavy? Use Crawl4AI. Static content? Use Jina first. If output is empty/broken, escalate.

Pattern 3: Deep Research

When: Need comprehensive research on a topic with multiple sources

WebSearch to find relevant pages
Pick top 3-5 URLs from results
Extract each with Crawl4AI or Jina
If any extraction fails (JS site), use the other tool
Synthesize extracted content into research summary

Token budget: ~5K per extracted page, budget 25K total for 5 pages

Pattern 4: Batch URL Scraping

When: Need content from multiple URLs (5+)

import asyncio
from crawl4ai import AsyncWebCrawler
urls = ['url1', 'url2', 'url3']

async def batch():
    async with AsyncWebCrawler() as crawler:
        for url in urls:

web-search

How to add

Drop this on your repo README

Related skills

understand-dashboard

understand-chat

understand-domain

dev-browser

Get new Pesquisa e Web skills every Monday