Dataset Generator
This skill is a tool-native dataset pipeline for Codex, Antigravity, and Claude Code.
- Use the IDE's own tools for browsing, reading, search, and reasoning.
- Use local Python scripts for deterministic normalization, state tracking, verification, deduplication, and export.
- Do not call external LLM-provider APIs as part of this skill.
Command surface
dataset generate "<request>" [--count <n>]dataset collect "<topic or query>" [--urls url1 url2] [--paths ./dir]dataset verify <path/to/file>dataset audit [<path/to/file>]dataset export --format <openai|huggingface|csv|jsonl|all> [--schema-file path] [--split 0.1]
If dataset generate does not include a size, default to 500 records.
If dataset collect does not include --max-results, default to 10 results per query.
Core architecture
sub-skills/contains the cognitive instructions.scripts/contains deterministic helpers.resources/internal-schema/canonical_schema.jsonis the fixed pipeline backbone.resources/target-schemas/contains preset export profiles.resources/templates/custom_flat_schema.jsonis the starting point for custom headers.
Fixed vs flexible schema
- The canonical internal schema is fixed.
- The final export schema is not universal and must be chosen per user request.
- For custom CSV or flat JSONL headers, create or update a schema file and pass it to
scripts/export.py.
Read sub-skills/dataset-strategy.md first whenever the target output schema is not already obvious.
Workflow selection
1. dataset generate
Use this when the user wants a new dataset or wants source material structured into one.
- Read
sub-skills/dataset-strategy.mdand explicitly decide:- request type
task_typesource_type- target export schema
- target effective example count
- whether this is a fresh run or a resume
If the user does not specify a size, set the target effective example count to 500.
2. If existing runs may matter, inspect the SQLite state before generating:
python3 -c "from scripts.utils.db import initialize_database, get_connection, list_runs; initialize_database(); conn = get_connection(); print([dict(row) for row in list_runs(conn, limit=5)]); conn.close()"
If there is a relevant unfinished or recent run, ask whether to resume or start fresh.
- Choose the source route:
- Topic-driven synthetic generation:
- Read
sub-skills/seed-generator.md. - Draft canonical JSONL records and import them with
--source-type generated. - If the requested count is large, work in batches until the target count is reached instead of stopping after the first small draft.
- Read
- URL or reference-material structuring:
- Read
sub-skills/local-collector.md. - First: try the IDE's native search/browsing tools to collect material directly.
- Fallback: if IDE tools are unavailable or the collection is large, run:
python3 scripts/collect.py --urls <url1> [url2 ...] --tool-context <context> - Draft canonical JSONL from the collected output and import with
--source-type url_reference.
- Read
- Existing dataset restructuring:
- Read
sub-skills/seed-generator.md. - Normalize the source dataset into canonical JSONL and import it with
--source-type raw_dataset.
- Read
- Internet-research dataset building:
- Read
sub-skills/local-collector.md. - First: use the IDE's native search tools to find evidence, draft canonical records, and import.
- Fallback: if IDE tools are unavailable or the target record count requires broad crawling, run:
python3 scripts/collect.py --query "<topic>" --max-results 10 --tool-context <context> - The collector outputs
workspace/collected_<timestamp>.jsonl; the agent then drafts proper instruction/response records and imports them with--source-type internet_research. - If the user does not specify a size, continue collecting and drafting until
500records are planned or imported.
- Read
- Load draft records into SQLite:
Preferred automated path when you already have planned batch files:
python3 scripts/build_loop.py --batch <drafts_batch_01.jsonl> --batch <drafts_batch_02.jsonl> --plan-file <coverage_plan.json> --source-type <generated|url_reference|raw_dataset|internet_research> --tool-context <codex|claude|antigravity> [--review-file <review.jsonl>] [--verify-min-response-length 5]
This orchestrates import-time dedup, optional verify/dedup, and a coverage check after every batch.
For short-label classification corpora, lower --verify-min-response-length so labels like VULNERABLE are not rejected by the generic heuristic floor.
If the coverage plan sets require_review_file: true, build_loop.py will fail fast unless --review-file is provided so semantic judging runs during the build.
After each batch, build_loop.py writes workspace/build_loop_progress.json with batches_done, last_coverage, and a drift object (drift_score, drift_flag, new_gaps, resolved_gaps). Read this file to check progress between batches. If drift_flag: true, inspect the new gaps before sending the next batch. Use record_history.py to append a lineage snapshot to workspace/record_history.jsonl at any point.
Manual import path:
python3 scripts/generate.py --input <drafts.jsonl> --source-type <generated|url_reference|raw_dataset|internet_research> --tool-context <codex|claude|antigravity> --dedup-threshold 0.85
Imported drafts are promoted into the runnable pipeline with status raw_generated unless they are explicit placeholder seeds.
When --dedup-threshold is used, near-duplicates are marked deduped immediately instead of inflating the raw count.
If the user is intentionally building red-team, security, pentest, prompt-injection, jailbreak, or system-prompt-leak training data, default to injection-tolerant import behavior. The scripts now auto-enable this for matching requests, and you can still pass --allow-injections explicitly for clarity. Use --enforce-security-flags only when you want strict flagging even on those corpora.
For untrusted sources, normalization also strips hostile control characters and may add metadata.security_flags plus metadata.requires_manual_review.
For generation requests, do not treat a small sample as the finished dataset unless the user explicitly asked for a small sample, prototype, or test run. Do not treat the raw imported count as success. The generation loop is complete only when the post-dedup effective count and per-bucket coverage targets are met.
4B. If you are not using build_loop.py, measure effective progress after each import batch before drafting the next batch:
python3 scripts/coverage.py --from-status raw_generated --from-status augmented --from-status verified_pass --threshold 0.85 --plan-file <coverage_plan.json>
The coverage plan should define:
target_effective_countmax_share_per_groupgroup_minimumskeyed by metadata paths such asmetadata.subtopic,metadata.context_type,metadata.response_shape, ormetadata.label- optional
required_fieldsfor metadata or provenance paths that every kept record must carry - optional
joint_group_rulesfor multi-axis balance such asdifficulty x labelorpersona x response_shape - optional
provenancerules such as a minimumreal_worldshare and required reference fields for real-world records - optional
response_lengthrules to cap median answer size or the share of oversized responses - optional
response_structurerules to prevent one dominant JSON or text skeleton from taking over the corpus - optional
response_prefixlimits to prevent one repeated opening from dominating the corpus - optional
model_visibilityrules to customize export-time sanitization for model-visibleinstructionandcontextwithout dropping audit metadata. If omitted, export applies a conservative bu