Dataset Generator

This skill is a tool-native dataset pipeline for Codex, Antigravity, and Claude Code.

Use the IDE's own tools for browsing, reading, search, and reasoning.
Use local Python scripts for deterministic normalization, state tracking, verification, deduplication, and export.
Do not call external LLM-provider APIs as part of this skill.

Command surface

dataset generate "<request>" [--count <n>]
dataset collect "<topic or query>" [--urls url1 url2] [--paths ./dir]
dataset verify <path/to/file>
dataset audit [<path/to/file>]
dataset export --format <openai|huggingface|csv|jsonl|all> [--schema-file path] [--split 0.1]

If dataset generate does not include a size, default to 500 records. If dataset collect does not include --max-results, default to 10 results per query.

Core architecture

sub-skills/ contains the cognitive instructions.
scripts/ contains deterministic helpers.
resources/internal-schema/canonical_schema.json is the fixed pipeline backbone.
resources/target-schemas/ contains preset export profiles.
resources/templates/custom_flat_schema.json is the starting point for custom headers.

Fixed vs flexible schema

The canonical internal schema is fixed.
The final export schema is not universal and must be chosen per user request.
For custom CSV or flat JSONL headers, create or update a schema file and pass it to scripts/export.py.

Read sub-skills/dataset-strategy.md first whenever the target output schema is not already obvious.

Workflow selection

1. `dataset generate`

Use this when the user wants a new dataset or wants source material structured into one.

Read sub-skills/dataset-strategy.md and explicitly decide:
- request type
- task_type
- source_type
- target export schema
- target effective example count
- whether this is a fresh run or a resume

If the user does not specify a size, set the target effective example count to 500. 2. If existing runs may matter, inspect the SQLite state before generating:

python3 -c "from scripts.utils.db import initialize_database, get_connection, list_runs; initialize_database(); conn = get_connection(); print([dict(row) for row in list_runs(conn, limit=5)]); conn.close()"

If there is a relevant unfinished or recent run, ask whether to resume or start fresh.

Choose the source route:

Topic-driven synthetic generation:
- Read sub-skills/seed-generator.md.
- Draft canonical JSONL records and import them with --source-type generated.
- If the requested count is large, work in batches until the target count is reached instead of stopping after the first small draft.
URL or reference-material structuring:
- Read sub-skills/local-collector.md.
- First: try the IDE's native search/browsing tools to collect material directly.
- Fallback: if IDE tools are unavailable or the collection is large, run:
```
python3 scripts/collect.py --urls <url1> [url2 ...] --tool-context <context>
```
- Draft canonical JSONL from the collected output and import with --source-type url_reference.
Existing dataset restructuring:
- Read sub-skills/seed-generator.md.
- Normalize the source dataset into canonical JSONL and import it with --source-type raw_dataset.
Internet-research dataset building:
- Read sub-skills/local-collector.md.
- First: use the IDE's native search tools to find evidence, draft canonical records, and import.
- Fallback: if IDE tools are unavailable or the target record count requires broad crawling, run:
```
python3 scripts/collect.py --query "<topic>" --max-results 10 --tool-context <context>
```
- The collector outputs workspace/collected_<timestamp>.jsonl; the agent then drafts proper instruction/response records and imports them with --source-type internet_research.
- If the user does not specify a size, continue collecting and drafting until 500 records are planned or imported.

Load draft records into SQLite:

Preferred automated path when you already have planned batch files:

python3 scripts/build_loop.py --batch <drafts_batch_01.jsonl> --batch <drafts_batch_02.jsonl> --plan-file <coverage_plan.json> --source-type <generated|url_reference|raw_dataset|internet_research> --tool-context <codex|claude|antigravity> [--review-file <review.jsonl>] [--verify-min-response-length 5]

This orchestrates import-time dedup, optional verify/dedup, and a coverage check after every batch. For short-label classification corpora, lower --verify-min-response-length so labels like VULNERABLE are not rejected by the generic heuristic floor. If the coverage plan sets require_review_file: true, build_loop.py will fail fast unless --review-file is provided so semantic judging runs during the build.

After each batch, build_loop.py writes workspace/build_loop_progress.json with batches_done, last_coverage, and a drift object (drift_score, drift_flag, new_gaps, resolved_gaps). Read this file to check progress between batches. If drift_flag: true, inspect the new gaps before sending the next batch. Use record_history.py to append a lineage snapshot to workspace/record_history.jsonl at any point.

Manual import path:

python3 scripts/generate.py --input <drafts.jsonl> --source-type <generated|url_reference|raw_dataset|internet_research> --tool-context <codex|claude|antigravity> --dedup-threshold 0.85

Imported drafts are promoted into the runnable pipeline with status raw_generated unless they are explicit placeholder seeds. When --dedup-threshold is used, near-duplicates are marked deduped immediately instead of inflating the raw count.

If the user is intentionally building red-team, security, pentest, prompt-injection, jailbreak, or system-prompt-leak training data, default to injection-tolerant import behavior. The scripts now auto-enable this for matching requests, and you can still pass --allow-injections explicitly for clarity. Use --enforce-security-flags only when you want strict flagging even on those corpora. For untrusted sources, normalization also strips hostile control characters and may add metadata.security_flags plus metadata.requires_manual_review.

For generation requests, do not treat a small sample as the finished dataset unless the user explicitly asked for a small sample, prototype, or test run. Do not treat the raw imported count as success. The generation loop is complete only when the post-dedup effective count and per-bucket coverage targets are met.

4B. If you are not using build_loop.py, measure effective progress after each import batch before drafting the next batch:

python3 scripts/coverage.py --from-status raw_generated --from-status augmented --from-status verified_pass --threshold 0.85 --plan-file <coverage_plan.json>

The coverage plan should define:

target_effective_count
max_share_per_group
group_minimums keyed by metadata paths such as metadata.subtopic, metadata.context_type, metadata.response_shape, or metadata.label
optional required_fields for metadata or provenance paths that every kept record must carry
optional joint_group_rules for multi-axis balance such as difficulty x label or persona x response_shape
optional provenance rules such as a minimum real_world share and required reference fields for real-world records
optional response_length rules to cap median answer size or the share of oversized responses
optional response_structure rules to prevent one dominant JSON or text skeleton from taking over the corpus
optional response_prefix limits to prevent one repeated opening from dominating the corpus
optional model_visibility rules to customize export-time sanitization for model-visible instruction and context without dropping audit metadata. If omitted, export applies a conservative bu

dataset-generator

Dataset Generator

Command surface

Core architecture

Fixed vs flexible schema

Workflow selection

1. `dataset generate`

Como adicionar

Comentários · Nenhum comentário

Dataset Generator

Command surface

Core architecture

Fixed vs flexible schema

Workflow selection

1. dataset generate

Como adicionar

Comentários · Nenhum comentário

1. `dataset generate`