mk:web-to-markdown

Fetch arbitrary URLs and return clean markdown with injection defense.

When to use

User provides a URL in chat: "summarize https://example.com/blog" → skill fires automatically
Agent needs an arbitrary external page that is NOT in curated docs (use mk:docs-finder for libraries/frameworks)
External references during research, intake, investigation, or planning — with the --wtm-accept-risk delegation gate

When NOT to use

Library or framework documentation → use mk:docs-finder (Context7, Context Hub, WebSearch)
Interactive browser testing → use mk:agent-browser
Playwright test automation → use mk:playwright-cli
Fetching sensitive/internal URLs → use the host runtime's built-in WebFetch tool (proxied by the runtime vendor)

Invocation patterns

1. Direct user invocation (no flag)

User: "fetch https://docs.example.com/api and explain the auth flow"
Agent: [invokes mk:web-to-markdown directly]

2. Cross-skill delegation (requires `--wtm-accept-risk`)

mk:research → mk:web-to-markdown --wtm-accept-risk <url>
mk:intake → mk:web-to-markdown --wtm-accept-risk <url>

Other skills MUST pass --wtm-accept-risk to delegate. Without it, the skill refuses the call and returns ERROR: cross-skill delegation requires --wtm-accept-risk flag. This forces conscious crossing of the trust boundary and creates an audit trail.

3. docs-finder priority override (`--wtm-approve`)

mk:docs-finder --wtm-approve <url>
# → skips Context7 / chub / WebSearch tiers
# → goes directly to mk:web-to-markdown

Used when the user knows the target URL is not in any curated index and wants to skip the wasted hops.

Security model

See references/security.md for the full threat model, attack surface, and defense architecture.

Non-negotiable defenses:

SSRF guard: scheme allowlist (http/https only), private/loopback/link-local IP block, redirect re-validation
10MB response size cap with streaming read + lxml huge_tree=False
DATA boundary wrapping on EVERY return (including previews)
Injection scanner: 50+ patterns + encoding detection (base64/ROT13/Unicode/zero-width) + context-flood WARN
HARD_STOP on injection hit — content quarantined, no programmatic override, manual user inspection required
Secret scrub on content AND URL BEFORE any disk write
privacy-block.sh hook-layer enforcement of SSRF + cache/manifest read blocks
injection-audit.py post-write library scan (called from persist_fetch.persist via scan_file import, not CLI)

Gotchas

Playwright is opt-in. Default is static fetch only. JS-rendered pages return an error pointing to .claude/scripts/bin/setup-workflow --system-deps. This is intentional — 200MB Chromium download is not worth the 5% of pages that need it.
robots.txt is respected with a 24h cache. Some doc sites disallow scraping; skill honors this. Override requires manual user action.
Fetch persistence grows unbounded in v1. Manual cleanup via rm -rf .claude/cache/web-fetches/*. v2 will add TTL auto-cleanup.
Reports may contain PII. Secret-scrub catches credentials but does NOT catch names, emails, user IDs in page body text. Treat cached reports as sensitive.
Injection STOP has no bypass. If the scanner halts a fetch, no flag reopens it. The user must manually inspect the quarantine file.
Slug is sha256-hashed path. Filenames don't carry path-embedded tokens — good for security, annoying for ls-based discovery. Use the manifest index.jsonl (behind privacy-block) to search.

Files

SKILL.md — this file (entrypoint + frontmatter)
references/security.md — master security spec (threat model, defenses, enforcement layers)
references/gotchas.md
scripts/fetch_as_markdown.py
scripts/persist_fetch.py
scripts/injection_detect.py
scripts/requirements.txt —
tests/test_smoke_real_urls.py

Dependencies

Static fetch (always):

.claude/skills/.venv/bin/pip install -r scripts/requirements.txt
# requests, readability-lxml, html2text, lxml, charset-normalizer

JS rendering (opt-in via .claude/scripts/bin/setup-workflow --system-deps):

.claude/skills/.venv/bin/pip install playwright==1.58.0
.claude/skills/.venv/bin/playwright install chromium  # ~200MB one-time

To enable JS rendering at runtime, set MEOWKIT_WEB_FETCH_JS=1 before invoking the skill. All three gates must be open: Playwright installed, MEOWKIT_WEB_FETCH_JS=1, and js=True per-call argument. See references/security.md for the three-layer JS gate spec.

mk:web-to-markdown

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

doc-coauthoring

algorithmic-art

seo-aeo-blog-writer

wordpress-centric-high-seo-optimized-blogwriting-skill

Recibe nuevas skills de Escrita e Conteúdo todos los lunes

mk:web-to-markdown

When to use

When NOT to use

Invocation patterns

1. Direct user invocation (no flag)

2. Cross-skill delegation (requires `--wtm-accept-risk`)

3. docs-finder priority override (`--wtm-approve`)

Security model

Gotchas

Files

Dependencies

Comentarios · Sin comentarios

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

doc-coauthoring

algorithmic-art

seo-aeo-blog-writer

wordpress-centric-high-seo-optimized-blogwriting-skill

Recibe nuevas skills de Escrita e Conteúdo todos los lunes

mk:web-to-markdown

When to use

When NOT to use

Invocation patterns

1. Direct user invocation (no flag)

2. Cross-skill delegation (requires --wtm-accept-risk)

3. docs-finder priority override (--wtm-approve)

Security model

Gotchas

Files

Dependencies

Comentarios · Sin comentarios

2. Cross-skill delegation (requires `--wtm-accept-risk`)

3. docs-finder priority override (`--wtm-approve`)