Paper Full-text Harvest
Pipeline for downloading academic paper full-text at scale. Handles the three classes of sources that exist in 2026:
- Publisher TDM APIs (Elsevier / Wiley / Springer) — for paywalled content where the institution has a subscription
- OA aggregators (Unpaywall / OpenAlex / Crossref) — for Open Access copies regardless of publisher
- Browser fallback (logged-in user profile) — for paywalled publishers without a TDM API (ACS / RSC / IEEE / AIP / IOP / APS / T&F / many CN journals)
The publisher router (auto_paper_download/publishers.py) recognises 25 DOI
prefixes across 19 families, each annotated with the right downstream path
(TDM client / OA aggregator / browser fallback) and a support tier. The router is
shared with the standalone auto-paper-harvester
CLI — see SUPPORTED_PUBLISHERS.md
there for the full per-publisher table.
Decision tree
Have a DOI list?
├── DOIs from Elsevier (10.1016, 10.1006, 10.1011)
│ └── Use ElsevierClient (TDM XML API) → §1
├── DOIs from Wiley (10.1002, 10.1111)
│ └── Use WileyClient (TDM PDF API) → §1
├── DOIs from Springer/Nature (10.1007, 10.1038, 10.1186, 10.1147)
│ ├── OA papers → SpringerClient OA API → §1
│ └── Subscription papers → fall through to OA/browser
├── Browser-only publishers without TDM API
│ (10.1021 ACS, 10.1039 RSC, 10.1126 Science, 10.1109 IEEE,
│ 10.1063 AIP, 10.1088/10.1143 IOP, 10.1103 APS, 10.1146 Annual Reviews,
│ 10.1080 T&F, 10.1116 AVS, 10.1149 ECS, 10.1364 Optica, 10.3938 KPS)
│ ├── Try OA first via Unpaywall/OpenAlex → §2
│ └── Last resort: browser fallback → §3
├── OA-leaning publishers (10.1073 PNAS, 10.3762 Beilstein)
│ └── OpenAlex/Unpaywall usually works → §2
└── Mixed list (typical case)
└── Use the orchestrated CLI (handles all of the above) → §0
§0. Quick start (orchestrated CLI)
For a typical mixed list of DOIs from Web of Science / Scopus export:
# Setup once
cp scripts/.env.example .env
# Edit .env to fill API keys (see §4 "Configuration")
# Run
python -m auto_paper_download \
--savedrecs your_export.xls \
--output-dir ./downloads/ \
--delay 2.0
The CLI:
- Parses DOIs from WoS savedrecs (or pass multiple
--savedrecs) - Routes each DOI to the right client by prefix
- Handles rate limiting + retries
- Per-publisher success summary at end
For resume-safe Elsevier bulk (the most common large run, e.g. 5000+ Elsevier DOIs):
python scripts/redownload_elsevier.py \
--excel papers.xlsx \
--output-dir ./elsevier_xml/ \
--resume \
--long-pause-every 200 \
--long-pause-sec 300
§1. Publisher TDM APIs
Read references/tdm-apis.md for full per-publisher details.
Quick reference:
| Publisher | API | Auth env var | Output | Rate limit |
|---|---|---|---|---|
| Elsevier | api.elsevier.com/content/article/doi/{DOI}?view=FULL | ELSEVIER_API_KEY + ELSEVIER_INSTTOKEN | XML (full-text) | ~5 req/sec |
| Wiley | api.wiley.com/onlinelibrary/tdm/v1/articles/{DOI} | WILEY_TDM_TOKEN | 3 req/sec hard cap | |
| Springer (OA) | api.springernature.com/openaccess/json | SPRINGER_API_KEY | JSON+text | 1 req/sec free |
| Crossref TDM | URL from link[] field with intended-application: text-mining | CR_CLICKTHROUGH_TOKEN | varies | varies |
Critical: All TDM APIs require institutional IP allowlisting — must run from the institution's network or VPN. Test with one DOI before bulk runs.
Instantiate clients directly:
from auto_paper_download.clients import ElsevierClient, WileyClient
elsevier = ElsevierClient() # reads env vars
xml_path = elsevier.download_structured_full_text(
doi="10.1016/j.ces.2025.123003",
article_dir=Path("downloads/10.1016_j.ces.2025.123003"),
)
wiley = WileyClient()
pdf_path = wiley.download_pdf(
doi="10.1002/anie.202500001",
article_dir=Path("downloads/10.1002_anie.202500001"),
)
§2. OA fallback (Unpaywall / OpenAlex / Crossref)
For papers that may have OA copies regardless of publisher.
from auto_paper_download.clients import UnpaywallClient, OpenAlexClient, CrossrefClient
# Unpaywall: best OA PDF URL
up = UnpaywallClient()
pdf_path = up.download_pdf(doi=doi, article_dir=Path("downloads/.."))
# OpenAlex: alternative OA source
oa = OpenAlexClient()
pdf_path = oa.download_pdf(doi=doi, article_dir=Path("downloads/.."))
# Crossref: tries to find publisher PDF link
cr = CrossrefClient()
pdf_path = cr.download_pdf(doi=doi, article_dir=Path("downloads/.."))
Always validate downloaded PDFs: First 4 bytes must be %PDF and file size > 50KB. The clients in this skill do this automatically.
Expected hit rate for OA fallback: 40-60% on a generic chemistry/biology list. Recent papers (>2023) have higher OA rates.
§3. Browser fallback (paywalled, no TDM)
For publishers where API isn't available but the user has institutional Cloudflare/SSO access via browser cookies. Slowest path — only use after exhausting §1–§2.
Two routes — pick one
Route A: OpenClaw browser tool | Route B: auto-paper-harvester v0.2+ CLI | |
|---|---|---|
| What | Drive the user's running Chrome via the agent's browser capability with profile="user" | Standalone CLI with built-in Playwright launch_persistent_context |
| Setup | None — reuses whatever Chrome the user is logged into | pip install 'auto-paper-download[browser]' && playwright install chromium |
| Cookies | User's existing daily-driver Chrome cookies (zero re-login) | Dedicated isolated profile; user logs into SSO once on first run |
| Selectors | Per-publisher CSS in references/browser-fallback.md (ACS / Wiley / RSC / T&F / Nature / AIP / CN journals) | Per-family selectors baked into browser_fallback.py (14 publisher families) |
| Best for | Agent workflows where the user is actively at the keyboard, fewer DOIs (< 100), or one-off rescue runs | Unattended bulk runs (1000+ DOIs), CI/headless servers, when you don't want to lock the user's Chrome |
| Cost | Ties up user's Chrome for ~5 s/paper | Spawns its own Chromium; user's browser stays free |
| Surface area | Lives in this skill (references/browser-fallback.md + browser tool) | Lives in the auto-paper-harvester repo (separate install) |
Decision rule:
- Default to Route A inside this skill (zero install, leverages session the user already has).
- Recommend Route B when the run is large (> 500 DOIs), runs unattended, or the user's Chrome shouldn't be locked. Both routes feed into the same downstream validation (PDF magic bytes, file size).
Route A details — OpenClaw browser tool
Read references/browser-fallback.md before starting. It covers:
- How to drive the user's logged-in Chrome via OpenClaw
browsertool withprofile="user" - Per-publisher CSS selectors for ACS, Wiley, RSC, T&F, Springer, Nature, AIP, and 3 major Chinese journals
- Cloudflare detection + retry strategy
- Single-tab reuse pattern (don't open a new tab per DOI — leaks)
- Kill-switch via
/tmp/stop_scrape
Route B details — auto-paper-harvester CLI
# One-time install (separate from this skill)
git clone https://github.com/jxtse/auto-paper-harvester.git
cd auto-paper-harvester
pip install -e '.[browser]' && playwright install chromium
# Run with same DOI file you'd otherwise feed to this skill
python -m auto_paper_download --savedrecs your_export.xls --use-browser-fallback
It routes every DOI through the same publisher TDM → OA → browser chain as this skill, with the browser pass running automatically against any DOI the API pipeline failed. See [its README](https://github.com/jxtse/auto-paper-harveste