Taoguba Crawler
This skill runs the Taoguba (tgb.cn) article crawlers located in the project root.
Prerequisites
- Python 3 with
requests,beautifulsoup4,python-dotenvinstalled - A
.envfile in the project root containingCOOKIEand optionallyUSER_AGENT
Available Crawlers
1. BBS Crawler (crawler_bbs.py)
Crawl the forum board at tgb.cn/bbs/1/1 using HTML scraping.
python crawler_bbs.py
- Extracts article list by parsing
a.overhide.mw300elements - Gets each article's main post and author replies
- Downloads images and embeds them as base64 in HTML
- Outputs:
output/bbs_YYYY-MM-DD.jsonandoutput/bbs_YYYY-MM-DD_HHMMSS.html
2. Home Crawler (crawler_home.py)
Crawl the homepage recommendations via JSON API (/newIndex/getZh).
python crawler_home.py
- Fetches articles from the JSON API (default 2 pages)
- Same content extraction and HTML generation as BBS crawler
- Outputs:
output/home_YYYY-MM-DD.jsonandoutput/home_YYYY-MM-DD_HHMMSS.html
Common Workflow
To run both crawlers:
python crawler_bbs.py && python crawler_home.py
Key Implementation Details
- Authentication: Both scripts read
COOKIEfrom.envviapython-dotenv - Rate limiting: 0.5-1s delay between requests to avoid being blocked
- Image handling: Images are downloaded and embedded as base64 in the HTML output
- Article content: Extracts main post (
#first) and author replies (.comment-datawith author badge) - Output directory: All results saved to
output/folder
Scripts
The crawler scripts are bundled in scripts/:
scripts/crawler_bbs.py- BBS forum crawler (HTML scraping)scripts/crawler_home.py- Homepage crawler (JSON API)
To run the bundled scripts directly:
python scripts/crawler_bbs.py
python scripts/crawler_home.py
Troubleshooting
- If no articles are returned, check that
.envcontains a validCOOKIEvalue - If image downloads fail, the HTML will show error messages inline
- Network timeouts default to 10-15 seconds per request