mk:web-to-markdown
Fetch arbitrary URLs and return clean markdown with injection defense.
When to use
- User provides a URL in chat:
"summarize https://example.com/blog"→ skill fires automatically - Agent needs an arbitrary external page that is NOT in curated docs (use
mk:docs-finderfor libraries/frameworks) - External references during research, intake, investigation, or planning — with the
--wtm-accept-riskdelegation gate
When NOT to use
- Library or framework documentation → use
mk:docs-finder(Context7, Context Hub, WebSearch) - Interactive browser testing → use
mk:agent-browser - Playwright test automation → use
mk:playwright-cli - Fetching sensitive/internal URLs → use the host runtime's built-in
WebFetchtool (proxied by the runtime vendor)
Invocation patterns
1. Direct user invocation (no flag)
User: "fetch https://docs.example.com/api and explain the auth flow"
Agent: [invokes mk:web-to-markdown directly]
2. Cross-skill delegation (requires --wtm-accept-risk)
mk:research → mk:web-to-markdown --wtm-accept-risk <url>
mk:intake → mk:web-to-markdown --wtm-accept-risk <url>
Other skills MUST pass --wtm-accept-risk to delegate. Without it, the skill refuses the call and returns ERROR: cross-skill delegation requires --wtm-accept-risk flag. This forces conscious crossing of the trust boundary and creates an audit trail.
3. docs-finder priority override (--wtm-approve)
mk:docs-finder --wtm-approve <url>
# → skips Context7 / chub / WebSearch tiers
# → goes directly to mk:web-to-markdown
Used when the user knows the target URL is not in any curated index and wants to skip the wasted hops.
Security model
See references/security.md for the full threat model, attack surface, and defense architecture.
Non-negotiable defenses:
- SSRF guard: scheme allowlist (http/https only), private/loopback/link-local IP block, redirect re-validation
- 10MB response size cap with streaming read + lxml
huge_tree=False - DATA boundary wrapping on EVERY return (including previews)
- Injection scanner: 50+ patterns + encoding detection (base64/ROT13/Unicode/zero-width) + context-flood WARN
- HARD_STOP on injection hit — content quarantined, no programmatic override, manual user inspection required
- Secret scrub on content AND URL BEFORE any disk write
privacy-block.shhook-layer enforcement of SSRF + cache/manifest read blocksinjection-audit.pypost-write library scan (called frompersist_fetch.persistviascan_fileimport, not CLI)
Gotchas
- Playwright is opt-in. Default is static fetch only. JS-rendered pages return an error pointing to
.claude/scripts/bin/setup-workflow --system-deps. This is intentional — 200MB Chromium download is not worth the 5% of pages that need it. - robots.txt is respected with a 24h cache. Some doc sites disallow scraping; skill honors this. Override requires manual user action.
- Fetch persistence grows unbounded in v1. Manual cleanup via
rm -rf .claude/cache/web-fetches/*. v2 will add TTL auto-cleanup. - Reports may contain PII. Secret-scrub catches credentials but does NOT catch names, emails, user IDs in page body text. Treat cached reports as sensitive.
- Injection STOP has no bypass. If the scanner halts a fetch, no flag reopens it. The user must manually inspect the quarantine file.
- Slug is sha256-hashed path. Filenames don't carry path-embedded tokens — good for security, annoying for
ls-based discovery. Use the manifestindex.jsonl(behind privacy-block) to search.
Files
SKILL.md— this file (entrypoint + frontmatter)references/security.md— master security spec (threat model, defenses, enforcement layers)references/gotchas.mdscripts/fetch_as_markdown.pyscripts/persist_fetch.pyscripts/injection_detect.pyscripts/requirements.txt—tests/test_smoke_real_urls.py
Dependencies
Static fetch (always):
.claude/skills/.venv/bin/pip install -r scripts/requirements.txt
# requests, readability-lxml, html2text, lxml, charset-normalizer
JS rendering (opt-in via .claude/scripts/bin/setup-workflow --system-deps):
.claude/skills/.venv/bin/pip install playwright==1.58.0
.claude/skills/.venv/bin/playwright install chromium # ~200MB one-time
To enable JS rendering at runtime, set MEOWKIT_WEB_FETCH_JS=1 before invoking the skill. All three gates must be open: Playwright installed, MEOWKIT_WEB_FETCH_JS=1, and js=True per-call argument. See references/security.md for the three-layer JS gate spec.