SSkilltecabyclaudinhocode
Enviar skill
← Voltar para o catálogo

crawl-wechat

Pesquisa e Web

Crawl and extract WeChat public account (微信公众号) articles into structured data and clean markdown. Use this skill whenever the user wants to scrape, crawl, read, extract, or fetch content from a WeChat article URL (mp.weixin.qq.com). Also trigger when the user mentions 微信公众号文章、抓取微信文章、 爬取公众号、or provides a WeChat article link and wants its content extracted. This skill handles the tricky parts: spoof

4estrelas
Ver no GitHub ↗Autor: gxcsoccerLicença: MIT

Crawl WeChat Articles

This skill extracts content from WeChat public account articles using the crawl4ai library. WeChat articles require special handling because they check the User-Agent header, render content dynamically, and use lazy-loading for images.

When to use

  • User provides a mp.weixin.qq.com/s/... URL and wants its content
  • User asks to scrape/crawl/extract/read a WeChat (微信) article
  • User wants to batch-process multiple WeChat article links
  • User needs the article in markdown or structured format

Setup (run once before first use)

Before running the script, ensure dependencies are installed:

pip install crawl4ai aiohttp && crawl4ai-setup

If crawl4ai is already importable and the browser backend is ready, skip this step. When the script fails with ModuleNotFoundError or browser-related errors, run the commands above to fix it.

How it works

Run the bundled script to crawl a WeChat article:

python <skill-dir>/scripts/crawl_wechat.py <URL> [--download-images] [--save-html] [--save-markdown] [--output-dir DIR]

The script outputs a JSON summary to stdout and optionally saves the full HTML and/or markdown to files.

Key technical details

  1. User-Agent spoofing: The script sets MicroMessenger/8.0.43 in the UA string so WeChat serves the full article instead of a "please open in WeChat" block.

  2. Dynamic wait: Uses wait_for="css:#js_content" to ensure the article body has fully rendered before scraping.

  3. Lazy-image fix: WeChat uses data-src for lazy-loaded images. The script injects JS to copy data-srcsrc so the markdown generator can pick up real image URLs.

  4. Structured extraction: Uses JsonCssExtractionStrategy with a schema targeting WeChat's DOM structure (#activity-name for title, #js_name for author, #publish_time for date, #js_content for body).

  5. Clean markdown with images: Uses DefaultMarkdownGenerator to produce readable markdown. SVG placeholder images and data-URI artifacts are cleaned out, preserving only real article images inline with the text.

  6. Image hotlink protection: WeChat images on mmbiz.qpic.cn block requests with non-QQ referrers. Use --download-images to download all images locally with the correct Referer header, automatically replacing remote URLs with local paths in both HTML and markdown output.

Extracted fields

FieldDescription
titleArticle title
authorPublic account name
publish_timePublication timestamp
account_descAccount description/bio
markdownClean markdown of article body
htmlRaw HTML of article body
urlFinal URL after any redirects

Example usage

Single article with images downloaded locally:

python <skill-dir>/scripts/crawl_wechat.py "https://mp.weixin.qq.com/s/xxx" --download-images --save-markdown --output-dir ./output

For programmatic use in Python:

from crawl_wechat import crawl_wechat_article
import asyncio

article = asyncio.run(crawl_wechat_article(
    "https://mp.weixin.qq.com/s/...",
    images_dir="./output/images",  # download images locally
))
print(article["title"])
print(article["markdown"])  # images reference local paths

Limitations

  • Requires a valid, non-expired WeChat article URL — cannot search or list articles from an account
  • High-frequency crawling may trigger WeChat's anti-bot measures (CAPTCHAs, IP blocks)
  • Some temporary share links expire after a period

Como adicionar

/plugin marketplace add gxcsoccer/wechat-article-crawler

O comando exato pode variar conforme o repositório. Confira o README no GitHub.

Comentários · Nenhum comentário

Entre para comentar. Entrar

  • Ainda não há comentários. Seja o primeiro.