Paper2Excel
Summarize every PDF paper in one folder into a fixed schema, then export one single-sheet .xlsx file. Keep summaries short, comparable, and strictly based on information inside the PDF.
Workflow
- Confirm the input is one folder containing text-based
.pdffiles. - Run
python3 scripts/paper2excel.py check-depsbefore the first use. - If dependencies are missing, run
python3 scripts/paper2excel.py install-deps --target /tmp/paper2excel_deps, then invoke later commands withPYTHONPATH=/tmp/paper2excel_deps. - Run
python3 scripts/paper2excel.py extract <folder> --output <extracted.json>to collect file names and extracted text. - Read the generated JSON and summarize papers one by one with the schema in this skill.
- Save the structured rows as JSON.
- Run
python3 scripts/paper2excel.py write-xlsx <rows.json> --output <paper_summaries.xlsx>to generate the workbook.
Process only the current folder by default. Do not recurse into subdirectories unless the user explicitly asks for it.
Output Schema
Create one row per paper with exactly these fields:
titlepublish+timekeywords研究现状motivationinsightmethod实验结论limitationother
You may also keep source_file in the intermediate JSON for traceability, but the final Excel should prioritize the fields above unless the user asks for extra columns.
Field Rules
title: Use the paper title in English from the PDF.publish+time: Use only information stated in the PDF. PreferVenue Year, for exampleAAAI 2024. If only the venue is known, write only the venue. If only the year is known, write only the year. If neither is known, leave it empty.keywords: Write exactly 3 English keywords or short phrases, separated by semicolons.研究现状: About 30 Chinese characters.motivation: About 30 Chinese characters.insight: About 30 Chinese characters.method: About 100 Chinese characters.实验结论: About 40 Chinese characters.limitation: About 30 Chinese characters.other: About 40 Chinese characters. Use it for one other interesting point that does not fit naturally into the other fields.
Summarization Rules
- Write
title,publish+time, andkeywordsin English. - Write all other fields in concise Chinese.
- Base every field only on the PDF itself. Do not use web search or outside knowledge.
- Prefer empty strings over guesses when information is missing.
- Do not copy long sentences from the paper. Compress into short, high-density statements.
- Keep each field self-contained and avoid repeating the same point across multiple fields.
- Treat
otheras a supplementary highlight, not a duplicate ofinsightor实验结论.
Extraction Guidance
- Use the extracted text JSON as the working source.
- Prefer the PDF metadata title when it is clean; otherwise infer the title from the first strong title-like line on the first page.
- Use the first page and conclusion-related sections to recover
title,publish+time, and实验结论. - Use abstract, introduction, related work, method, experiments, and limitation/future-work passages to fill the remaining fields.
- If extraction quality is poor for one file, keep the row conservative instead of hallucinating.
Scripts
scripts/paper2excel.py check-deps: Check whetherpypdfandopenpyxlare available.scripts/paper2excel.py install-deps: Install missing packages into a target directory such as/tmp/paper2excel_deps.scripts/paper2excel.py extract: Scan one folder, extract text from each PDF, and save JSON for downstream summarization.scripts/paper2excel.py write-xlsx: Convert structured JSON rows into one single-sheet.xlsxfile.
Example
python3 scripts/paper2excel.py check-deps
python3 scripts/paper2excel.py install-deps --target /tmp/paper2excel_deps
PYTHONPATH=/tmp/paper2excel_deps python3 scripts/paper2excel.py extract /path/to/papers --output /tmp/papers.json
After summarizing into /tmp/paper_rows.json:
PYTHONPATH=/tmp/paper2excel_deps python3 scripts/paper2excel.py write-xlsx /tmp/paper_rows.json --output /tmp/paper_summaries.xlsx