Web Scraper

Overview

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.

When to Use This Skill

When the user mentions "scraper" or related topics
When the user mentions "scraping" or related topics
When the user mentions "extrair dados web" or related topics
When the user mentions "web scraping" or related topics
When the user mentions "raspar dados" or related topics
When the user mentions "coletar dados site" or related topics

Do Not Use This Skill When

The task is unrelated to web scraper
A simpler, more specific tool can handle the request
The user needs general-purpose assistance without domain expertise

How It Works

Execute phases in strict order. Each phase feeds the next.

1. CLARIFY  ->  2. RECON  ->  3. STRATEGY  ->  4. EXTRACT  ->  5. TRANSFORM  ->  6. VALIDATE  ->  7. FORMAT

Never skip Phase 1 or Phase 2. They prevent wasted effort and failed extractions.

Fast path: If user provides URL + clear data target + the request is simple (single page, one data type), compress Phases 1-3 into a single action: fetch, classify, and extract in one WebFetch call. Still validate and format.

Capabilities

Multi-strategy: WebFetch (static), Browser automation (JS-rendered), Bash/curl (APIs), WebSearch (discovery)
Extraction modes: table, list, article, product, contact, FAQ, pricing, events, jobs, custom
Output formats: Markdown tables (default), JSON, CSV
Pagination: auto-detect and follow (page numbers, infinite scroll, load-more)
Multi-URL: extract same structure across sources with comparison and diff
Validation: confidence ratings (HIGH/MEDIUM/LOW) on every extraction
Auto-escalation: WebFetch fails silently -> automatic Browser fallback
Data transforms: cleaning, normalization, deduplication, enrichment
Differential mode: detect changes between scraping runs

Web Scraper

Multi-strategy web data extraction with intelligent approach selection, automatic fallback escalation, data transformation, and structured output.

Phase 1: Clarify

Establish extraction parameters before touching any URL.

Required Parameters

Parameter	Resolve	Default
Target URL(s)	Which page(s) to scrape?	(required)
Data Target	What specific data to extract?	(required)
Output Format	Markdown table, JSON, CSV, or text?	Markdown table
Scope	Single page, paginated, or multi-URL?	Single page

Optional Parameters

Parameter	Resolve	Default
Pagination	Follow pagination? Max pages?	No, 1 page
Max Items	Maximum number of items to collect?	Unlimited
Filters	Data to exclude or include?	None
Sort Order	How to sort results?	Source order
Save Path	Save to file? Which path?	Display only
Language	Respond in which language?	User's lang
Diff Mode	Compare with previous run?	No

Clarification Rules

If user provides a URL and clear data target, proceed directly to Phase 2. Do NOT ask unnecessary questions.
If request is ambiguous (e.g. "scrape this site"), ask ONLY: "What specific data do you want me to extract from this page?"
Default to Markdown table output. Mention alternatives only if relevant.
Accept requests in any language. Always respond in the user's language.
If user says "everything" or "all data", perform recon first, then present what's available and let user choose.

Discovery Mode

When user has a topic but no specific URL:

Use WebSearch to find the most relevant pages
Present top 3-5 URLs with descriptions
Let user choose which to scrape, or scrape all
Proceed to Phase 2 with selected URL(s)

Example: "find and extract pricing data for CRM tools" -> WebSearch("CRM tools pricing comparison 2026") -> Present top results -> User selects -> Extract

Phase 2: Reconnaissance

Analyze the target page before extraction.

Step 2.1: Initial Fetch

Use WebFetch to retrieve and analyze the page structure:

WebFetch(
  url = TARGET_URL,
  prompt = "Analyze this page structure and report:
    1. Page type: article, product listing, search results, data table,
       directory, dashboard, API docs, FAQ, pricing page, job board, events, or other
    2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,
       accordion/collapsible sections, tabs
    3. Approximate number of distinct data items visible
    4. JavaScript rendering indicators: empty containers, loading spinners,
       SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS
    5. Pagination: next/prev links, page numbers, load-more buttons,
       infinite scroll indicators, total results count
    6. Data density: how much structured, extractable data exists
    7. List the main data fields/columns available for extraction
    8. Embedded structured data: JSON-LD, microdata, OpenGraph tags
    9. Available download links: CSV, Excel, PDF, API endpoints"
)

Step 2.2: Evaluate Fetch Quality

Signal	Interpretation	Action
Rich content with data clearly visible	Static page	Strategy A (WebFetch)
Empty containers, "loading...", minimal text	JS-rendered	Strategy B (Browser)
Login wall, CAPTCHA, 403/401 response	Blocked	Report to user
Content present but poorly structured	Needs precision	Strategy B (Browser)
JSON or XML response body	API endpoint	Strategy C (Bash/curl)
Download links for CSV/Excel available	Direct data file	Strategy C (download)

Step 2.3: Content Classification

Classify into an extraction mode:

Mode	Indicators	Examples
`table`	HTML `<table>`, grid layout with headers	Price comparison, statistics, specs
`list`	Repeated similar elements, card grids	Search results, product listings
`article`	Long-form text with headings/paragraphs	Blog post, news article, docs
`product`	Product name, price, specs, images, rating	E-commerce product page
`contact`	Names, emails, phones, addresses, roles	Team page, staff directory
`faq`	Question-answer pairs, accordions	FAQ page, help center
`pricing`	Plan names, prices, features, tiers	SaaS pricing page
`events`	Dates, locations, titles, descriptions	Event listings, conferences
`jobs`	Titles, companies, locations, salaries	Job boards, career pages
`custom`	User specified CSS selectors or fields	Anything not matching above

Record: page type, extraction mode, JS rendering needed (yes/no), available fields, structured data present (JSON-LD etc.).

If user asked for "everything", present the available fields and let them choose.

Phase 3: Strategy Selection

Choose the extraction approach based on recon results.

web-scraper

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

dev-browser

agent-browser

understand-chat

understand-dashboard

Recibe nuevas skills de Pesquisa e Web todos los lunes

Web Scraper

Overview

When to Use This Skill

Do Not Use This Skill When

How It Works

Capabilities

Web Scraper

Phase 1: Clarify

Required Parameters

Optional Parameters

Clarification Rules

Discovery Mode

Phase 2: Reconnaissance

Step 2.1: Initial Fetch

Step 2.2: Evaluate Fetch Quality

Step 2.3: Content Classification

Phase 3: Strategy Selection

Deci

Comentarios · Sin comentarios