Semantic Scholar Search Workflow
Search academic papers via the Semantic Scholar API using a structured 4-phase workflow.
Critical rule: NEVER make multiple sequential Bash calls for API requests. Always write ONE Python script that runs all searches, then execute it once. All rate limiting is handled inside s2.py automatically.
Phase 1: Understand & Plan
Parse the user's intent and choose a search strategy:
Decision Tree
Default to
search_bulk(). Per Semantic Scholar's own docs, bulk search is preferred over relevance search for most cases because relevance search is more resource-intensive. Usesearch_relevance()only when you need TLDR fields or author/citation details inline.
| User wants... | Strategy | Function |
|---|---|---|
| Broad topic exploration | Bulk search (preferred) | search_bulk() with build_bool_query() |
| Need TLDR / inline author details | Relevance search | search_relevance() |
| Precise technical terms, exact phrases | Bulk search with boolean operators | search_bulk() with build_bool_query() |
| Specific passages or methods | Snippet search | search_snippets() |
| Known paper by title | Title match | match_title() |
| Known paper by DOI/PMID/ArXiv | Direct lookup | get_paper() |
| Papers citing a known work | Citation traversal | get_citations() |
| Related to one paper | Single-seed recommendations | find_similar() |
| Related to multiple papers | Multi-seed recommendations | recommend() |
| Find a researcher | Author search | search_authors() |
| Researcher's profile | Author details | get_author() |
| Researcher's publications | Author papers | get_author_papers() |
Query Construction Rules
- Ambiguous terms (e.g., "stem cells" could mean mesenchymal or stem-like T cells): Use
build_bool_query()with exact phrases and exclusions- Example:
build_bool_query(phrases=["stem-like T cells"], required=["CD4", "TCF7"], excluded=["mesenchymal", "hematopoietic stem cell"])
- Example:
- Multi-context queries (e.g., "topic X in cancer AND autoimmunity"): Plan separate searches, deduplicate with
deduplicate() - Broad topics: Use
search_relevance()with filters (year, venue, fieldsOfStudy, minCitationCount)
Plan Filters
| Filter | Use when |
|---|---|
year="2020-" | Recent work only |
publication_date="2024-01-01:2024-06-30" | Precise date range (YYYY-MM-DD) |
fields_of_study="Medicine" | Restrict to domain |
min_citations=10 | Only established papers |
pub_types="Review" | Find reviews/meta-analyses |
pub_types="ClinicalTrial" | Clinical trials only |
open_access=True | Only open access papers |
Checkpoint: Before proceeding, verify: (1) search strategy matches user intent, (2) filters are appropriate, (3) query is specific enough to avoid irrelevant results.
Phase 2: Execute Search
Write ONE Python script that begins with the standard prelude below, then runs all searches:
# --- Standard prelude (use in every script) ---
import sys, os, glob
_candidates = [
os.path.expanduser("~/.claude/skills/semanticscholar-skill"),
os.path.expanduser("~/.openclaw/skills/semanticscholar-skill"),
*glob.glob(os.path.expanduser("~/.claude/plugins/**/semanticscholar-skill"), recursive=True),
*glob.glob(os.path.expanduser("~/.codex/skills/semanticscholar-skill")),
".",
]
SKILL_DIR = next((p for p in _candidates if os.path.isfile(os.path.join(p, "s2.py"))), None)
if SKILL_DIR is None:
raise RuntimeError("Cannot locate semanticscholar-skill (s2.py not found)")
sys.path.insert(0, SKILL_DIR)
from s2 import *
# --- end prelude ---
# Build precise query
q = build_bool_query(
phrases=["stem-like T cells"],
required=["CD4", "IBD"],
excluded=["mesenchymal"]
)
papers = search_bulk(q, max_results=30, year="2018-", fields_of_study="Medicine")
papers = deduplicate(papers)
print(format_results(papers, "Stem-like CD4 T cells in IBD"))
Save to /tmp/s2_search.py, then run with python3 /tmp/s2_search.py in a single Bash call. Rate limiting, retries, and backoff are automatic inside s2.py.
No API key: The skill works without S2_API_KEY. When the key is absent or invalid, s2.py automatically switches to unauthenticated mode (no x-api-key header) and widens the request gap to 5 s. Per S2 docs, anonymous calls share a global 1000 req/s pool across all unauthenticated users and can be "further throttled during periods of heavy use" — so a conservative 5 s gap protects against the heavy-use throttling, even though the steady-state pool is generous. If you still see sustained 429s, raise _MIN_GAP to 10 s. Keep max_results ≤ 30 per search and combine fewer searches per script. S2 recommends including an API key on every request — get one at https://www.semanticscholar.org/product/api#api-key-form.
Checkpoint: Verify the script ran successfully (no exceptions) and returned results. If 0 results, broaden the query or relax filters before presenting.
Worked Examples
Each example below assumes the standard prelude from Phase 2 is at the top of the script.
Example 1: Author workflow — "Find papers by Yann LeCun on self-supervised learning"
authors = search_authors("Yann LeCun", max_results=5)
print(format_authors(authors))
# Use the first match's ID to get their papers
author_id = authors[0]["authorId"]
papers = get_author_papers(author_id, max_results=50)
# Filter locally for topic
ssl_papers = [p for p in papers if "self-supervised" in (p.get("title") or "").lower()]
print(format_results(ssl_papers, "Yann LeCun - Self-Supervised Learning"))
Example 2: Citation chain with intent — "Who cited the Transformer paper and how did they use it?"
paper = get_paper("DOI:10.48550/arXiv.1706.03762")
print(f"Title: {paper['title']}, Citations: {paper['citationCount']}")
# Citation envelopes carry contextsWithIntent — keep them, don't flatten.
citing = get_citations(paper["paperId"], max_results=50)
citing.sort(key=lambda c: (c.get("citingPaper") or {}).get("citationCount", 0), reverse=True)
print(format_citations(citing, max_items=10)) # renders intent labels + context snippet
Example 3: Multi-seed recommendations with BibTeX export — "Find papers like these two but not about NLP"
recs = recommend(
positive_ids=["DOI:10.1038/nature14539", "ARXIV:2010.11929"],
negative_ids=["ARXIV:1706.03762"],
limit=20
)
print(format_results(recs, "Vision papers like Deep Learning & ViT, excluding NLP"))
# Export BibTeX for top results
bib_data = batch_papers([r["paperId"] for r in recs[:10]], fields="title,citationStyles")
print(export_bibtex(bib_data))
Phase 3: Summarize & Present
- Use
format_results()for consistent output (summary table + top-10 details) - If user's language is Chinese, present summaries in Chinese
- Always note total results count and search strategy used
- Highlight most relevant papers based on the user's specific question
Phase 4: User Interaction Loop
After presenting results, always offer these options:
- Translate — titles/summaries to Chinese (or other language)
- Details — full abstract for specific paper numbers
- Refine — narrow or expand search with different terms/filters
- Similar — find papers similar to a specific result (
find_similar()) - Citations — who cited a specific paper and how (
get_citations()+format_citations()for intent labels) - Export — save results via
export_bibtex(),export_markdown(), orexport_json() - Done — end search session
Loop until user says done. Each follow-up uses the same single-script pattern.
Additional Resources
- S2folks GitHub — Official Semantic Scholar code examples: https://github.com/allenai/s2-folks
- Postman Collection — No-code API testing: linked from https://www.semanticscholar.org/product/api/tutorial
- **API Documentati