Crawl WeChat Articles
This skill extracts content from WeChat public account articles using the crawl4ai library. WeChat articles require special handling because they check the User-Agent header, render content dynamically, and use lazy-loading for images.
When to use
- User provides a
mp.weixin.qq.com/s/...URL and wants its content - User asks to scrape/crawl/extract/read a WeChat (微信) article
- User wants to batch-process multiple WeChat article links
- User needs the article in markdown or structured format
Setup (run once before first use)
Before running the script, ensure dependencies are installed:
pip install crawl4ai aiohttp && crawl4ai-setup
If crawl4ai is already importable and the browser backend is ready, skip this step. When the script fails with ModuleNotFoundError or browser-related errors, run the commands above to fix it.
How it works
Run the bundled script to crawl a WeChat article:
python <skill-dir>/scripts/crawl_wechat.py <URL> [--download-images] [--save-html] [--save-markdown] [--output-dir DIR]
The script outputs a JSON summary to stdout and optionally saves the full HTML and/or markdown to files.
Key technical details
-
User-Agent spoofing: The script sets
MicroMessenger/8.0.43in the UA string so WeChat serves the full article instead of a "please open in WeChat" block. -
Dynamic wait: Uses
wait_for="css:#js_content"to ensure the article body has fully rendered before scraping. -
Lazy-image fix: WeChat uses
data-srcfor lazy-loaded images. The script injects JS to copydata-src→srcso the markdown generator can pick up real image URLs. -
Structured extraction: Uses
JsonCssExtractionStrategywith a schema targeting WeChat's DOM structure (#activity-namefor title,#js_namefor author,#publish_timefor date,#js_contentfor body). -
Clean markdown with images: Uses
DefaultMarkdownGeneratorto produce readable markdown. SVG placeholder images and data-URI artifacts are cleaned out, preserving only real article images inline with the text. -
Image hotlink protection: WeChat images on
mmbiz.qpic.cnblock requests with non-QQ referrers. Use--download-imagesto download all images locally with the correct Referer header, automatically replacing remote URLs with local paths in both HTML and markdown output.
Extracted fields
| Field | Description |
|---|---|
title | Article title |
author | Public account name |
publish_time | Publication timestamp |
account_desc | Account description/bio |
markdown | Clean markdown of article body |
html | Raw HTML of article body |
url | Final URL after any redirects |
Example usage
Single article with images downloaded locally:
python <skill-dir>/scripts/crawl_wechat.py "https://mp.weixin.qq.com/s/xxx" --download-images --save-markdown --output-dir ./output
For programmatic use in Python:
from crawl_wechat import crawl_wechat_article
import asyncio
article = asyncio.run(crawl_wechat_article(
"https://mp.weixin.qq.com/s/...",
images_dir="./output/images", # download images locally
))
print(article["title"])
print(article["markdown"]) # images reference local paths
Limitations
- Requires a valid, non-expired WeChat article URL — cannot search or list articles from an account
- High-frequency crawling may trigger WeChat's anti-bot measures (CAPTCHAs, IP blocks)
- Some temporary share links expire after a period