Meta-Optimize: Outer-Loop Harness Optimization for ARIS

Analyze accumulated usage logs and propose optimizations for: $ARGUMENTS

Privilege boundary — this skill is a READ-ONLY PRODUCER

meta-optimize proposes; it does not land. The mutation of the skill corpus is the exclusive job of a separate, human-invoked skill: /meta-apply. This split is structural, not advisory — it is why a missed instruction cannot let this loop apply its own patch (the self-acquittal failure mode):

No Write/Edit tool. This skill cannot edit a SKILL.md / shared-reference / any corpus file with the frictionless mutators. Its only outputs are the REPORT and staged patch files, written under .aris/meta/ (a scratch area, never the corpus).
No apply step. There is no in-skill "apply the patch" path (see Step 6). The producer ends by staging approved patches for /meta-apply; a human must then invoke /meta-apply to land them. That human action is the landing gate.
Bash writes to the corpus are filtered, not impossible — be honest about the layers. What IS fully closed: the accidental / in-flow self-acquittal — this skill has no Write/Edit and no apply step, so an honest run cannot slip into editing the corpus. Defense-in-depth: install the corpus_write_guard PreToolUse hook (like meta_logging.json), which DENIES the common Bash shell-writes (>, tee, sed -i, cp/mv, touch, open(...,'w')) to corpus paths. This is a blacklist, NOT a complete sandbox — a deliberately obscured Bash write (git apply, patch, $var/absolute paths, language file APIs) is not all caught. Full structural prevention requires either removing this skill's Bash or an FS sandbox — over-built for a not-yet-load-bearing producer, so deferred to when the gate carries real auto-modification volume (a brick-3 trigger). The intended backstop against a deliberate write is detection, not prevention — a corpus change with no valid/current provenance stamp (content-hash mismatch) would be catchable in a pre-push integrity check — but that verifier is NOT yet built (provenance.py has content_hash but no integrity-check subcommand, and no pre-push hook runs one). So today the deliberate-write case is neither prevented nor actively detected; track the integrity verifier as a follow-up before this producer goes load-bearing. Its legitimate Bash writes go only to .aris/meta/.

See shared-references/acceptance-gate.md: a loop can DRIVE (propose, review) same-model, but the ACQUITTAL that lands a change must be cross-model (Step 4 jury) and the landing must be a separate human-gated act (/meta-apply).

Context

ARIS is a research harness — a system of skills, bridges, workflows, and artifact contracts that wraps around LLMs to orchestrate research. This skill implements a prototype outer loop that observes how the harness is used and proposes improvements to the harness itself (not to the research artifacts it produces).

Inspired by Meta-Harness (Lee et al., 2026): the key insight is that harness design matters as much as model weights, and harness engineering can be partially automated by logging execution traces and using them to guide improvements.

What This Skill Optimizes (Harness Components)

Component	Example	Optimizable?
SKILL.md prompts	Reviewer instructions, quality gates, step descriptions	Yes
Default parameters	`difficulty: medium`, `MAX_ROUNDS: 4`, `threshold: 6/10`	Yes
Convergence rules	When to stop the review loop, retry counts	Yes
Workflow ordering	Skill chain sequence within a workflow	Yes
Artifact schemas	What fields go in EXPERIMENT_LOG.md, idea-stage/IDEA_REPORT.md	Cautious
MCP bridge config	Which reviewer model, routing rules	No (infra)

Not optimized: The research artifacts themselves (papers, code, experiments). That's what the regular workflows do.

Prerequisites

Logging must be active. Copy templates/claude-hooks/meta_logging.json into your project's .claude/settings.json (or merge the hooks section).
Sufficient data. At least 5 complete workflow runs logged in .aris/meta/events.jsonl. The skill will check and warn if insufficient.

Workflow

Step 0: Check Data Availability

EVENTS_FILE=".aris/meta/events.jsonl"
if [ ! -f "$EVENTS_FILE" ]; then
    echo "ERROR: No event log found at $EVENTS_FILE"
    echo "Enable logging first: copy templates/claude-hooks/meta_logging.json into .claude/settings.json"
    exit 1
fi

EVENT_COUNT=$(wc -l < "$EVENTS_FILE")
SKILL_INVOCATIONS=$(grep -c '"skill_invoke"' "$EVENTS_FILE" || echo 0)
SESSIONS=$(grep -c '"session_start"' "$EVENTS_FILE" || echo 0)

echo "📊 Event log: $EVENT_COUNT events, $SKILL_INVOCATIONS skill invocations, $SESSIONS sessions"

if [ "$SKILL_INVOCATIONS" -lt 5 ]; then
    echo "⚠️  Insufficient data (<5 skill invocations). Continue using ARIS normally and re-run later."
    exit 0
fi

Step 1: Analyze Usage Patterns

Read .aris/meta/events.jsonl and compute:

Frequency analysis:

Which skills are invoked most often?
Which slash commands do users type most?
What parameter overrides are most common? (These suggest bad defaults.)

Failure analysis:

Which tools fail most often? In which skills?
What error patterns repeat? (OOM, import, compilation, timeout)
How many auto-debug retries per workflow run?

Convergence analysis (for auto-review-loop):

Average rounds to reach threshold
Score trajectory shape (fast improvement? plateau? oscillation?)
Which review round catches the most critical issues?
Do users override difficulty mid-run?

Human intervention analysis:

Where do users interrupt with manual prompts during workflows?
What manual corrections do users make most? (These indicate skill gaps.)

Present findings as a structured summary table.

Step 2: Identify Optimization Targets

Based on Step 1, rank optimization opportunities by expected impact:

## Optimization Opportunities (ranked)

| # | Target | Signal | Proposed Change | Expected Impact |
|---|--------|--------|-----------------|-----------------|
| 1 | auto-review-loop default threshold | Users override to 7/10 in 60% of runs | Change default from 6/10 to 7/10 | Fewer manual overrides |
| 2 | experiment-bridge retry count | 40% of runs hit max retries on OOM | Add OOM-specific recovery (reduce batch size) | Fewer failed experiments |
| 3 | paper-write de-AI patterns | Users manually fix "delve" in 80% of runs | Add "delve" to default watchword list | Fewer manual edits |

If $ARGUMENTS specifies a target skill, focus analysis on that skill only. If $ARGUMENTS is empty or "all", analyze all skills with sufficient data.

Step 3: Generate Patch Proposals

For each optimization target, generate a concrete diff:

--- a/skills/auto-review-loop/SKILL.md
+++ b/skills/auto-review-loop/SKILL.md
@@ -15,7 +15,7 @@
 ## Constants
 
-- **SCORE_THRESHOLD = 6** — Minimum review score to accept.
+- **SCORE_THRESHOLD = 7** — Minimum review score to accept. (Raised based on usage data: 60% of users overrode to 7+.)

Rules for patch generation:

One patch per optimization target
Each patch must include a comment explaining WHY (with data from the log)
Patches must be minimal — change only what the data supports
Never change artifact schemas or MCP bridge config in v1
Never change behavior that would break existing user workflows
Anti-self-poisoning screen (see shared-references/capture-antipatterns.md): run a proposed patch's rationale through tools/capture_filter.py (resolve via the canonical chain). NEVER propose a change tha

meta-optimize

Como adicionar

Cole no README do seu repo

Skills relacionadas

dev-browser

agent-browser

understand-chat

understand-dashboard

Receba novas skills de Pesquisa e Web toda segunda