paper-fetch

Fetch the PDF for a paper given a DOI (or title). Tries multiple sources in priority order and stops at the first hit.

Resolution order

Unpaywall — https://api.unpaywall.org/v2/{doi}?email=$UNPAYWALL_EMAIL, read best_oa_location.url_for_pdf (skipped if UNPAYWALL_EMAIL not set)
Semantic Scholar — https://api.semanticscholar.org/graph/v1/paper/DOI:{doi}?fields=openAccessPdf,externalIds
arXiv — if externalIds.ArXiv present, https://arxiv.org/pdf/{arxiv_id}.pdf
PubMed Central OA — if PMCID present, https://www.ncbi.nlm.nih.gov/pmc/articles/{pmcid}/pdf/
bioRxiv / medRxiv — if DOI prefix is 10.1101, query https://api.biorxiv.org/details/{server}/{doi} for the latest version PDF URL
Publisher direct (institutional mode only — PAPER_FETCH_INSTITUTIONAL=1) — DOI-prefix → publisher PDF template (Nature / Science / Wiley / Springer / ACS / PNAS / NEJM / Sage / T&F / Elsevier). The caller's own subscription IP / cookies / EZproxy are what authorize the fetch; unauthorized responses fail the %PDF check and fall through to step 7.
Sci-Hub mirrors (on by default; disable with PAPER_FETCH_NO_SCIHUB=1) — last-resort fallback. Tries the mirror list in PAPER_FETCH_SCIHUB_MIRRORS (or built-in defaults sci-hub.ru, sci-hub.st, sci-hub.su, sci-hub.box, sci-hub.red, sci-hub.al, sci-hub.mk, sci-hub.ee) in order; on full miss, scrapes https://www.sci-hub.pub/ once per process for fresh mirrors. CAPTCHA / missing-paper pages have no PDF iframe and fall through silently.
Otherwise → report failure with title/authors so the user can request via ILL

CloakBrowser fallback (download layer, opt-in — PAPER_FETCH_CLOAK=1). This is not a separate source: it sits at the download chokepoint, so it applies to any of the sources above. When a resolved PDF URL is blocked by Cloudflare — HTTP 403/429, or a "Just a moment…" HTML interstitial served in place of the file — and the operator opted in, the URL is retried through CloakBrowser (a stealth Chromium that passes the JS challenge) via the cloak_pdf.py companion. Bytes it returns are re-validated through the same %PDF magic-byte + 50 MB checks; on success the result carries via: "cloak". Off by default, fails closed (missing CloakBrowser → silent fall-through), and the agent cannot opt in — see CloakBrowser access below.

If only a title is given, pass it directly via --title "<title>". Resolution chain:

Crossref query.title — primary; covers all major journal/conference DOIs
Semantic Scholar /paper/search/match — fallback when Crossref's top match is low-confidence (match_score < 40) or the gap to the runner-up is < 3. Critically, S2 covers arXiv-only preprints (no Crossref DOI). When S2 surfaces a paper that has only an arXiv id, the canonical 10.48550/arXiv.<id> is synthesized so the download chain stays uniform.
Crossref's best guess (low-confidence) — used only when both resolvers struggled. The result envelope sets meta.title_resolution.low_confidence: true plus a low_confidence_reason (score_below_threshold / ambiguous_runner_up) so an agent can either bail or confirm via --dry-run.

Either way the resolved DOI, the winning resolver, the full resolvers_tried list, and the top candidate matches are all surfaced under meta.title_resolution.

If semanticscholar-skill is registered, it can serve as a richer pre-step for title → DOI resolution — useful when you also need relevance ranking, snippet search, or citation context, not just a DOI. The agent writes a Python script using the skill's match_title() to read externalIds.DOI, then runs paper-fetch <doi>. When the result has only an ArXiv id (no DOI), synthesize 10.48550/arXiv.<ArXiv> and pass that to paper-fetch.

When only the DOI is needed, --title is the single-command path — paper-fetch's built-in Crossref → S2 chain handles most cases.

Usage

python scripts/fetch.py <DOI> [options]
python scripts/fetch.py --title "<paper title>" [options]
python scripts/fetch.py --batch <FILE|-> [options]
python scripts/fetch.py schema           # machine-readable self-description

Flags

The flags below are the ones an agent composes in normal use. For the complete contract — including --dry-run, --pretty, --stream, --overwrite, --timeout, --version, plus parameter types and exit-code mappings — run python scripts/fetch.py schema (machine-readable, drift-checked via schema_version).

Flag	Default	Description
`doi`	—	DOI to fetch (positional). Use `-` to read a single DOI from stdin
`--title TITLE`	—	Paper title; resolved to a DOI via Crossref before download. Mutually exclusive with positional DOI / `--batch`
`--batch FILE`	—	File with one DOI per line for bulk download. Use `-` to read from stdin
`--out DIR`	`pdfs`	Output directory
`--format`	auto	`json` for agents, `text` for humans. Auto-detects: `json` when stdout is not a TTY, `text` when it is
`--idempotency-key KEY`	—	Safe-retry key. Re-running with the same key replays the original envelope from `<out>/.paper-fetch-idem/` without network I/O

Agent discovery: `schema` subcommand

python scripts/fetch.py schema

Emits a complete machine-readable description of the CLI on stdout (no network). Includes cli_version, schema_version, parameter types, exit codes, error codes, envelope shapes, and environment variables. Agents should read this once, cache it against schema_version, and re-read when the cached version drifts.

Output contract

stdout emits a single JSON envelope. Every envelope carries a meta slot.

Success (all DOIs resolved):

{
  "ok": true,
  "data": {
    "results": [
      {
        "doi": "10.1038/s41586-021-03819-2",
        "success": true,
        "source": "unpaywall",
        "pdf_url": "https://www.nature.com/articles/s41586-021-03819-2.pdf",
        "file": "pdfs/Jumper_2021_Highly_accurate_protein_structure_predic.pdf",
        "meta": {"title": "Highly accurate protein structure prediction with AlphaFold", "year": 2021, "author": "Jumper"},
        "sources_tried": ["unpaywall"]
      }
    ],
    "summary": {"total": 1, "succeeded": 1, "failed": 0},
    "next": []
  },
  "meta": {
    "request_id": "req_a908f5156fc1",
    "latency_ms": 2036,
    "schema_version": "1.9.0",
    "cli_version": "0.13.1",
    "sources_tried": ["unpaywall"]
  }
}

Partial (batch mode — some DOIs failed, exit code reflects the failure class):

{
  "ok": "partial",
  "data": {
    "results": [
      { "doi": "10.1038/s41586-021-03819-2", "success": true, "source": "unpaywall", ... },
      {
        "doi": "10.1234/nonexistent",
        "success": false,
        "source": null,
        "pdf_url": null,
        "file": null,
        "meta": {},
        "sources_tried": ["unpaywall", "semantic_scholar"],
        "error": {
          "code": "not_found",
          "message": "No open-access PDF found",
          "retryable": true,
          "retry_after_hours": 168,
          "reason": "OA availability changes over time; retry after embargo lifts or preprint appears"
        }
      }
    ],
    "summary": {"total": 2, "succeeded": 1, "failed": 1},
    "next": ["paper-fetch 10.1234/nonexistent --out pdfs"]
  },
  "meta": { ... }
}

The next slot is an array of suggested follow-up commands: re-invoking them retries only the failed subset. Combine with --idempotency-key to make the whole batch safely retriable without re-downloading the already-succeeded items.

Failure (bad arguments, exit code 3):

{
  "ok": false,
  "error": {
    "code": "validation_error",
    "message": "Provide a DOI or --batch file",
    "retryable": false
  },
  "meta": { ... }
}

Per-item skipped (destination al

paper-fetch

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

pdf

pptx

canvas-design

theme-factory

Recibe nuevas skills de Documentos todos los lunes

paper-fetch

Resolution order

Usage

Flags

Agent discovery: `schema` subcommand

Output contract

Comentarios · Sin comentarios

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

pdf

pptx

canvas-design

theme-factory

Recibe nuevas skills de Documentos todos los lunes

paper-fetch

Resolution order

Usage

Flags

Agent discovery: schema subcommand

Output contract

Comentarios · Sin comentarios

Agent discovery: `schema` subcommand