Taoguba Crawler

This skill runs the Taoguba (tgb.cn) article crawlers located in the project root.

Prerequisites

Python 3 with requests, beautifulsoup4, python-dotenv installed
A .env file in the project root containing COOKIE and optionally USER_AGENT

Available Crawlers

1. BBS Crawler (`crawler_bbs.py`)

Crawl the forum board at tgb.cn/bbs/1/1 using HTML scraping.

python crawler_bbs.py

Extracts article list by parsing a.overhide.mw300 elements
Gets each article's main post and author replies
Downloads images and embeds them as base64 in HTML
Outputs: output/bbs_YYYY-MM-DD.json and output/bbs_YYYY-MM-DD_HHMMSS.html

2. Home Crawler (`crawler_home.py`)

Crawl the homepage recommendations via JSON API (/newIndex/getZh).

python crawler_home.py

Fetches articles from the JSON API (default 2 pages)
Same content extraction and HTML generation as BBS crawler
Outputs: output/home_YYYY-MM-DD.json and output/home_YYYY-MM-DD_HHMMSS.html

Common Workflow

To run both crawlers:

python crawler_bbs.py && python crawler_home.py

Key Implementation Details

Authentication: Both scripts read COOKIE from .env via python-dotenv
Rate limiting: 0.5-1s delay between requests to avoid being blocked
Image handling: Images are downloaded and embedded as base64 in the HTML output
Article content: Extracts main post (#first) and author replies (.comment-data with author badge)
Output directory: All results saved to output/ folder

Scripts

The crawler scripts are bundled in scripts/:

scripts/crawler_bbs.py - BBS forum crawler (HTML scraping)
scripts/crawler_home.py - Homepage crawler (JSON API)

To run the bundled scripts directly:

python scripts/crawler_bbs.py
python scripts/crawler_home.py

Troubleshooting

If no articles are returned, check that .env contains a valid COOKIE value
If image downloads fail, the HTML will show error messages inline
Network timeouts default to 10-15 seconds per request

taoguba-crawler

How to add

Drop this on your repo README

Related skills

understand-dashboard

understand-chat

understand-domain

dev-browser

Get new Pesquisa e Web skills every Monday

Taoguba Crawler

Prerequisites

Available Crawlers

1. BBS Crawler (`crawler_bbs.py`)

2. Home Crawler (`crawler_home.py`)

Common Workflow

Key Implementation Details

Scripts

Troubleshooting

Comments · No comments

How to add

Drop this on your repo README

Related skills

understand-dashboard

understand-chat

understand-domain

dev-browser

Get new Pesquisa e Web skills every Monday

Taoguba Crawler

Prerequisites

Available Crawlers

1. BBS Crawler (crawler_bbs.py)

2. Home Crawler (crawler_home.py)

Common Workflow

Key Implementation Details

Scripts

Troubleshooting

Comments · No comments

1. BBS Crawler (`crawler_bbs.py`)

2. Home Crawler (`crawler_home.py`)