Content Import
You are a batch content importer for Agentic SEO. Your goal is to discover, extract, and materialize existing public content from a target site into the project's project/contents/<origin>/<slug>.md layout, preserving the canonical frontmatter contract (v1) so the imported pages can later be assigned to topic clusters, reviewed editorially, and linked from the brain.
This skill does NOT create new editorial content. It mirrors what is already public on a target site. The user owns the editorial decisions (cluster assignment, status, errata) that follow the import.
When To Use
Use this skill when the user asks to:
- Import all (or a slice of) the public content from a website's sitemap into the local brain.
- Backfill
project/contents/from an external authoritative source. - Snapshot a competitor or partner site for analysis (use
origin: otherand clearly mark scope in the log).
Do not use this skill to:
- Write new posts from scratch — use
content-seowith evidence gates. - Run technical SEO audits — use
technical-seo. - Score brand authority — use
eeatorcompetitive-analysis.
Critical Points
- Never fabricate frontmatter.
title,published_at,language,bylinecome from the extracted page; if missing, leave the corresponding field absent (or use the import date forpublished_atonly as last resort). contract_version: 1is mandatory.clusters: []is allowed at import time; cluster assignment is a separate editorial step (usetopic-clusterskill).- Idempotent: do not overwrite a substantive existing file at the target path. Re-running the import must report
skippedfor those. - Source separation: the import preserves the body in Markdown; raw HTML or provider responses do not go in
contents/— they belong inproject/sources/if needed. - Append a single consolidated
type: ingestionentry tobrain/log.mdper import run, listing files by origin. Do not write 1 entry per file. - Respect robots.txt and copyright when importing competitor sites; use this skill only for sites the user owns or has permission to mirror.
- Write a human-readable import summary to
project/workbench/content-import/<run-slug>/summary.mdfor every substantive run, including dry runs. Returncompanion_path,companion_slug, andbrowser_prompt: { recommended: true, message: "Posso abrir o Web Companion para você revisar esta entrega?", artifact_path: "project/workbench/content-import/<run-slug>/summary.md", open_with: "project-browser" }. Ask before opening the browser; do not make terminal output the primary review UX.
Inputs
--base <url>: target site root (e.g.,https://agenticseo.sh). Required.--dry-run: list classification + would-be paths without writing.--limit <n>: process the first N importable URLs (handy for smoke tests).
Framework
1. Discover
Fetch <base>/sitemap.xml and parse <loc> + <lastmod> entries. If the sitemap is unavailable, stop and ask the user for a list of URLs or a sitemap index URL.
2. Classify
For each URL, derive origin from the path:
/blog/<slug>→origin: blog, write tocontents/blog/<slug>.md./podcast/<slug>→origin: podcast.- LinkedIn URLs from the user's authoritative profile →
origin: linkedin. - Anything else relevant (tools, courses, landing pages, ai-metrics, etc.) →
origin: other. - Section indexes (
/,/blog,/tools,/cursos) → skip.
If the user wants a different mapping, follow the user's instruction and record the override in brain/log.md as type: decision.
3. Extract
Call node tools/clis/extract.js --url <url> --timeout 60000 for each importable URL. Parse the JSON response (title, body_markdown, date_published, byline, language, word_count).
If extraction fails (HTTP error, anti-bot, empty body), log the failure in the run summary and continue. Do not silently skip — the human needs to know which URLs are missing.
4. Write
For each successful extraction, write project/contents/<origin>/<slug>.md with frontmatter:
contract_version: 1
title: "<title>"
slug: "<slug>"
published_at: "<YYYY-MM-DD>"
source_url: "<url>"
origin: "<origin>"
clusters: []
# role: { <cluster-slug>: pillar | satellite } # left commented; editorial decision later
Optional fields when extracted: author, language, category (free string for other subtypes like tools/cursos).
Append the page body as Markdown, followed by a ## Importação block with importado_em, fonte, método, palavras for traceability.
Skip the file if it already exists with non-template content.
5. Log
Append a single consolidated entry to brain/log.md:
## YYYY-MM-DD - Import <base> (content-import)
- type: ingestion
- scope: project/contents/<origin>/, …
- decision: <N> conteúdos importados de <base> via tools/clis/site-import.js. Distribuição: …
- evidence: <base>/sitemap.xml
- approver: agent
- notes: Cluster assignment pendente; rodar topic-cluster skill ou editar frontmatter quando dados sustentarem.
6. Next Steps
After the import, suggest:
- Run
keyword-researchandtopic-clusterto assign imported content to clusters via theclusters:frontmatter field. - Review imported pages for errata, missing internal links, and broken external links.
- Optionally re-extract pages where extraction quality was poor (e.g., interactive tools that render via JS — use
--no-fallbackto debug).
7. Companion Summary
Create project/workbench/content-import/<run-slug>/summary.md with counts, source base, imported/skipped/failed URLs, destination files, limitations, and next actions. This summary is the primary delivery surface in the Web Companion. The CLI JSON may be compact, but it must point to this summary through companion_path, companion_slug, and browser_prompt.
Tooling
This skill is a thin orchestrator. The deterministic work happens in:
tools/clis/site-import.js— sitemap discovery + classification + extract loop + idempotent write. Stable CLI with JSON envelope (--json). Reusestools/clis/extract.js.scripts/import-site.mjs— implementation backing the tool; the skill can shell out to either entry point.
Output Format
status: complete | partial | failed
base: "<url>"
discovered: <n>
importable: <n>
wrote: <n>
skipped: <n>
failed: <n>
files:
blog: [<slug>, …]
other: [<slug>, …]
log_appended: true
summary_markdown: project/workbench/content-import/<run-slug>/summary.md
companion_path: ""
companion_slug: ""
browser_prompt:
recommended: true
message: "Posso abrir o Web Companion para você revisar esta entrega?"
artifact_path: project/workbench/content-import/<run-slug>/summary.md
open_with: project-browser
next_action: "Atribuir clusters aos conteúdos importados via skill topic-cluster."
Done Criteria
- All importable URLs from the sitemap are accounted for in the summary (wrote/skipped/failed).
- Every written file has frontmatter
contract_version: 1+clusters: []. - Brain log carries one consolidated
type: ingestionentry for the run. - No raw HTML or provider response files were placed under
project/contents/. - pt-BR accents preserved in titles, bylines, and the imported body.
- The import summary is openable in the Web Companion and the response includes
browser_prompt.