Paper Full-text Harvest

Name: paper-fulltext-harvest
Rating: 5 (43 reviews)
Author: jxtse

Pipeline for downloading academic paper full-text at scale. Handles the three classes of sources that exist in 2026:

Publisher TDM APIs (Elsevier / Wiley / Springer) — for paywalled content where the institution has a subscription
OA aggregators (Unpaywall / OpenAlex / Crossref) — for Open Access copies regardless of publisher
Browser fallback (logged-in user profile) — for paywalled publishers without a TDM API (ACS / RSC / IEEE / AIP / IOP / APS / T&F / many CN journals)

The publisher router (auto_paper_download/publishers.py) recognises 25 DOI prefixes across 19 families, each annotated with the right downstream path (TDM client / OA aggregator / browser fallback) and a support tier. The router is shared with the standalone auto-paper-harvester CLI — see SUPPORTED_PUBLISHERS.md there for the full per-publisher table.

Decision tree

Have a DOI list?
├── DOIs from Elsevier (10.1016, 10.1006, 10.1011)
│   └── Use ElsevierClient (TDM XML API)             → §1
├── DOIs from Wiley (10.1002, 10.1111)
│   └── Use WileyClient (TDM PDF API)                → §1
├── DOIs from Springer/Nature (10.1007, 10.1038, 10.1186, 10.1147)
│   ├── OA papers → SpringerClient OA API            → §1
│   └── Subscription papers → fall through to OA/browser
├── Browser-only publishers without TDM API
│   (10.1021 ACS, 10.1039 RSC, 10.1126 Science, 10.1109 IEEE,
│    10.1063 AIP, 10.1088/10.1143 IOP, 10.1103 APS, 10.1146 Annual Reviews,
│    10.1080 T&F, 10.1116 AVS, 10.1149 ECS, 10.1364 Optica, 10.3938 KPS)
│   ├── Try OA first via Unpaywall/OpenAlex          → §2
│   └── Last resort: browser fallback                → §3
├── OA-leaning publishers (10.1073 PNAS, 10.3762 Beilstein)
│   └── OpenAlex/Unpaywall usually works             → §2
└── Mixed list (typical case)
    └── Use the orchestrated CLI (handles all of the above) → §0

§0. Quick start (orchestrated CLI)

For a typical mixed list of DOIs from Web of Science / Scopus export:

# Setup once
cp scripts/.env.example .env
# Edit .env to fill API keys (see §4 "Configuration")

# Run
python -m auto_paper_download \
    --savedrecs your_export.xls \
    --output-dir ./downloads/ \
    --delay 2.0

The CLI:

Parses DOIs from WoS savedrecs (or pass multiple --savedrecs)
Routes each DOI to the right client by prefix
Handles rate limiting + retries
Per-publisher success summary at end

For resume-safe Elsevier bulk (the most common large run, e.g. 5000+ Elsevier DOIs):

python scripts/redownload_elsevier.py \
    --excel papers.xlsx \
    --output-dir ./elsevier_xml/ \
    --resume \
    --long-pause-every 200 \
    --long-pause-sec 300

§1. Publisher TDM APIs

Read references/tdm-apis.md for full per-publisher details.

Quick reference:

Publisher	API	Auth env var	Output	Rate limit
Elsevier	`api.elsevier.com/content/article/doi/{DOI}?view=FULL`	`ELSEVIER_API_KEY` + `ELSEVIER_INSTTOKEN`	XML (full-text)	~5 req/sec
Wiley	`api.wiley.com/onlinelibrary/tdm/v1/articles/{DOI}`	`WILEY_TDM_TOKEN`	PDF	3 req/sec hard cap
Springer (OA)	`api.springernature.com/openaccess/json`	`SPRINGER_API_KEY`	JSON+text	1 req/sec free
Crossref TDM	URL from `link[]` field with `intended-application: text-mining`	`CR_CLICKTHROUGH_TOKEN`	varies	varies

Critical: All TDM APIs require institutional IP allowlisting — must run from the institution's network or VPN. Test with one DOI before bulk runs.

Instantiate clients directly:

from auto_paper_download.clients import ElsevierClient, WileyClient

elsevier = ElsevierClient()  # reads env vars
xml_path = elsevier.download_structured_full_text(
    doi="10.1016/j.ces.2025.123003",
    article_dir=Path("downloads/10.1016_j.ces.2025.123003"),
)

wiley = WileyClient()
pdf_path = wiley.download_pdf(
    doi="10.1002/anie.202500001",
    article_dir=Path("downloads/10.1002_anie.202500001"),
)

§2. OA fallback (Unpaywall / OpenAlex / Crossref)

For papers that may have OA copies regardless of publisher.

from auto_paper_download.clients import UnpaywallClient, OpenAlexClient, CrossrefClient

# Unpaywall: best OA PDF URL
up = UnpaywallClient()
pdf_path = up.download_pdf(doi=doi, article_dir=Path("downloads/.."))

# OpenAlex: alternative OA source
oa = OpenAlexClient()
pdf_path = oa.download_pdf(doi=doi, article_dir=Path("downloads/.."))

# Crossref: tries to find publisher PDF link
cr = CrossrefClient()
pdf_path = cr.download_pdf(doi=doi, article_dir=Path("downloads/.."))

Always validate downloaded PDFs: First 4 bytes must be %PDF and file size > 50KB. The clients in this skill do this automatically.

Expected hit rate for OA fallback: 40-60% on a generic chemistry/biology list. Recent papers (>2023) have higher OA rates.

§3. Browser fallback (paywalled, no TDM)

For publishers where API isn't available but the user has institutional Cloudflare/SSO access via browser cookies. Slowest path — only use after exhausting §1–§2.

Two routes — pick one

	Route A: OpenClaw `browser` tool	Route B: `auto-paper-harvester` v0.2+ CLI
What	Drive the user's running Chrome via the agent's `browser` capability with `profile="user"`	Standalone CLI with built-in Playwright `launch_persistent_context`
Setup	None — reuses whatever Chrome the user is logged into	`pip install 'auto-paper-download[browser]' && playwright install chromium`
Cookies	User's existing daily-driver Chrome cookies (zero re-login)	Dedicated isolated profile; user logs into SSO once on first run
Selectors	Per-publisher CSS in `references/browser-fallback.md` (ACS / Wiley / RSC / T&F / Nature / AIP / CN journals)	Per-family selectors baked into `browser_fallback.py` (14 publisher families)
Best for	Agent workflows where the user is actively at the keyboard, fewer DOIs (< 100), or one-off rescue runs	Unattended bulk runs (1000+ DOIs), CI/headless servers, when you don't want to lock the user's Chrome
Cost	Ties up user's Chrome for ~5 s/paper	Spawns its own Chromium; user's browser stays free
Surface area	Lives in this skill (`references/browser-fallback.md` + `browser` tool)	Lives in the `auto-paper-harvester` repo (separate install)

Decision rule:

Default to Route A inside this skill (zero install, leverages session the user already has).
Recommend Route B when the run is large (> 500 DOIs), runs unattended, or the user's Chrome shouldn't be locked. Both routes feed into the same downstream validation (PDF magic bytes, file size).

Route A details — OpenClaw `browser` tool

Read references/browser-fallback.md before starting. It covers:

How to drive the user's logged-in Chrome via OpenClaw browser tool with profile="user"
Per-publisher CSS selectors for ACS, Wiley, RSC, T&F, Springer, Nature, AIP, and 3 major Chinese journals
Cloudflare detection + retry strategy
Single-tab reuse pattern (don't open a new tab per DOI — leaks)
Kill-switch via /tmp/stop_scrape

Route B details — auto-paper-harvester CLI

# One-time install (separate from this skill)
git clone https://github.com/jxtse/auto-paper-harvester.git
cd auto-paper-harvester
pip install -e '.[browser]' && playwright install chromium

# Run with same DOI file you'd otherwise feed to this skill
python -m auto_paper_download --savedrecs your_export.xls --use-browser-fallback

It routes every DOI through the same publisher TDM → OA → browser chain as this skill, with the browser pass running automatically against any DOI the API pipeline failed. See [its README](https://github.com/jxtse/auto-paper-harveste

paper-fulltext-harvest

How to add

Drop this on your repo README

Related skills

dev-browser

agent-browser

understand-chat

understand-dashboard

Get new Pesquisa e Web skills every Monday

Paper Full-text Harvest

Decision tree

§0. Quick start (orchestrated CLI)

§1. Publisher TDM APIs

§2. OA fallback (Unpaywall / OpenAlex / Crossref)

§3. Browser fallback (paywalled, no TDM)

Two routes — pick one

Route A details — OpenClaw `browser` tool

Route B details — auto-paper-harvester CLI

Comments · No comments

How to add

Drop this on your repo README

Related skills

dev-browser

agent-browser

understand-chat

understand-dashboard

Get new Pesquisa e Web skills every Monday

Paper Full-text Harvest

Decision tree

§0. Quick start (orchestrated CLI)

§1. Publisher TDM APIs

§2. OA fallback (Unpaywall / OpenAlex / Crossref)

§3. Browser fallback (paywalled, no TDM)

Two routes — pick one

Route A details — OpenClaw browser tool

Route B details — auto-paper-harvester CLI

Comments · No comments

Route A details — OpenClaw `browser` tool