YouTube — Transcript Extraction & Content Reformatting
YouTube video URL → timestamped transcript → summary / chapters / thread / blog / quotes
Language
All process output to user (progress updates, process notifications) follows the user's language.
Objective
Extract the full transcript from a YouTube video's built-in transcript panel, then transform it into the output format the user requests.
Prerequisites
- Target YouTube video page is already open in the browser:
https://www.youtube.com/watch?v={VIDEO_ID}
Pre-execution Checks
1. Tool Readiness
If browser-act has been confirmed available in the current session → skip this step.
Invoke browser-act via Skill tool to load usage. If installation or configuration issues arise, follow its guidance to resolve then retry.
Capability Components
This Skill's operational boundary = what the user can manually do in their browser. It only reads data already displayed to the user on the page, never bypassing authentication or access controls. JS code is encapsulated in Python files under the
scripts/directory, invoked viaeval "$(python scripts/xxx.py)". Use the bash tool for execution.
DOM: Check transcript availability and list languages
eval "$(python scripts/get-languages.py)"
No parameters. Reads ytInitialPlayerResponse from the current page.
Output example:
{
"available_languages": [
{"code": "en", "name": "English", "kind": "manual", "is_auto": false},
{"code": "en", "name": "English (auto-generated)", "kind": "asr", "is_auto": true}
],
"count": 2
}
Returns {"error": true, "message": "..."} when transcripts are disabled or page is not a YouTube video.
DOM: Open transcript panel
eval "$(python scripts/open-transcript-panel.py)"
No parameters. Clicks the "Show transcript" button below the video (handles multiple UI language variants automatically for robustness).
Must call wait stable after this to allow the panel to fully load.
Output example:
{"success": true, "label": "内容转文字"}
DOM: Extract all transcript segments
eval "$(python scripts/extract-transcript-segments.py)"
No parameters. Scrolls the open transcript panel to trigger lazy loading for long videos, then extracts all segments.
Output example:
{
"segment_count": 24,
"segments": [
{"ts": "0:18", "text": "We're no strangers to love"},
{"ts": "0:27", "text": "You know the rules and so do I"}
],
"full_text": "We're no strangers to love You know the rules...",
"timestamped_text": "0:18 We're no strangers to love\n0:27 You know the rules..."
}
Composite: Full transcript fetch workflow
navigate https://www.youtube.com/watch?v={VIDEO_ID}→wait stableeval "$(python scripts/get-languages.py)"— confirm transcripts are available; note the language listeval "$(python scripts/open-transcript-panel.py)"— open the panelwait stable— wait for panel content to loadeval "$(python scripts/extract-transcript-segments.py)"— extract all segments
Use timestamped_text from the output as input for the Transform step below.
Transform: Content Reformatting
After fetching the transcript, transform it based on what the user requests. If the user did not specify a format, default to the Full Document — output all five sections in order.
- Summary: Concise 5–10 sentence overview of the entire video
- Chapters: Group by topic shifts, output timestamped chapter list
- Thread: Twitter/X thread format — numbered posts, each under 280 characters
- Blog post: Full article with title, H2 sections per major topic, key quotes, and takeaways
- Quotes: Notable quotes with their timestamps
Default Full Document output order (when no specific format is requested):
- Summary
- Chapters
- Thread
- Blog Post
- Quotes
Workflow
- Fetch transcript using the Composite component above.
- Validate: confirm
segment_count >= 1. If empty, tell the user the video has transcripts disabled. - Chunk if needed: if
full_textexceeds ~50,000 characters, splittimestamped_textinto overlapping chunks (~40K characters with 2K overlap) and summarize each chunk before merging. - Transform into the requested format(s) using the
timestamped_textfield. If no format specified, produce all five sections. - Verify: re-read the output for coherence, correct timestamps (if chapters), and completeness before presenting.
Example — Chapters Output
0:00 Introduction — host opens with the problem statement
3:45 Background — prior work and why existing solutions fall short
12:20 Core method — walkthrough of the proposed approach
24:10 Results — benchmark comparisons and key takeaways
31:55 Q&A — audience questions on scalability and next steps
Example — Thread Output
1/ Just watched an incredible video on [topic]. Key takeaways 🧵
2/ First insight: [point]. This matters because [reason].
3/ The surprising part: [finding]. Most assume [belief], but this shows otherwise.
4/ Practical takeaway: [action].
5/ Full video: [URL]
Error Handling
- Transcripts disabled:
get-languages.pyreturns error; tell user and suggest checking if captions are available on the video page - Private/unavailable video: page will not load correctly; relay the error and ask user to verify the URL
- Transcript button not found: usually means the user is not on a video page, or the page hasn't finished loading; navigate to the URL and retry
- No segments after panel opens: retry
open-transcript-panel.py+wait stable+extract-transcript-segments.pyonce
Known Limitations
- Language selection: the transcript panel shows the language YouTube defaults to for the user's region. Switching to a specific language requires changing the caption language in the player's CC settings first; automatic language switching is not implemented.
- Auto-generated transcripts (kind: asr) may have lower accuracy than manual captions.
- Videos that require login to view will not have a transcript panel accessible.
Execution Efficiency
- Batch orchestration: Write a bash script to loop through video URLs serially within a single session — navigate to each video, run the 3-step composite workflow, save result, then move to the next. Do not parallelize within one browser. To increase throughput for large batches, open multiple stealth browser sessions and distribute URLs across them.
- Test before batch execution: After writing a batch script, first test with 1–2 videos to confirm the full workflow runs correctly; only then run the full batch.
- Reduce redundant pre-operations: Pre-execution checks (tool readiness) only need to run once per session; skip them for subsequent videos in the same batch.
- Error resumption: Save each video's result immediately after extraction; on failure, resume from the failed video rather than starting over.
Success Criteria
segment_count >= 1 AND full_text length > 0
Experience Notes
Path: {working-directory}/browser-act-skill-forge-memories/youtube-content-youtube-transcript.memory.md
Before execution: If the file exists, read it first — it records unexpected situations encountered during past executions (e.g., a strategy has become ineffective); adjust strategy order accordingly.
After execution: If an unexpected situation is encountered (strategy became ineffective, page redesigned, anti-scraping upgraded, better path discovered), append a line:
{YYYY-MM-DD}: {what happened} → {conclusion}
Normal execution does not write to the file.