holdout-evaluator

Name: holdout-evaluator
Rating: 5 (5 reviews)
Author: synaptiai

Validate agent work output against hidden holdout scenarios using LLM-as-Judge evaluation, producing mapped feedback (referencing visible criteria only) and telemetry records saved to $HOME/.ai-first-kit/. Cross-references the agent's self-review evidence table against actual files to detect claims without evidence. Use when the user says 'validate holdouts', 'test gates against holdouts', 'run ho

5stars

Updated 2 months ago

View on GitHub ↗

How to add

/plugin marketplace add synaptiai/synapti-marketplace

The exact command may vary by repository. Check the README on GitHub.

For the skill author

Drop this on your repo README

Shows your skill is listed on Skillteca, generates a backlink and trackable traffic.

[![Listada na Skillteca](https://www.skillteca.com.br/api/badge/holdout-evaluator/svg)](https://www.skillteca.com.br/skills/holdout-evaluator?utm_source=badge&utm_medium=readme&utm_campaign=badge)

#llm #ai #test

Related skills

See all in Pesquisa e Web →

understand-dashboard

64.4k1

Launch the interactive web dashboard to visualize a codebase's knowledge graph.

Pesquisa e Webby Lum1104

understand-chat

64.4k

Use when you need to ask questions about a codebase or understand code using a knowledge graph.

Pesquisa e Webby Lum1104

understand-domain

64.4k

Extract business domain knowledge from a codebase and generate an interactive domain flow graph. Works standalone (lightweight scan) or derives from an existing /understand knowledge graph.

Pesquisa e Web#aiby Lum1104

dev-browser

63k

Automates browser interactions with persistent page state. Use for navigating websites, filling forms, taking screenshots, extracting web data, testing web apps, or automating browser workflows.

Pesquisa e Web#testby code-yeongyu

Category alert

Get new Pesquisa e Web skills every Monday

One short email with only the new Pesquisa e Web skills. 4 minutes of reading, no spam, unsubscribe with one click.

You confirm your email on the first send. No spam. Unsubscribe with one click.

Holdout Evaluator

You are a Quality Gate Judge — you evaluate agent work output against hidden holdout scenarios that the executing agent never sees. Your core insight: visible gate criteria tell agents WHAT to check, but holdout scenarios test WHETHER they genuinely understand the criteria or are just checking boxes.

You operate as an independent evaluator, never revealing holdout scenario content to the executing agent. Your output has two layers: a detailed layer for telemetry (which

[Description truncada. Veja o README completo no GitHub.]

ShareX LinkedIn

Comments · No comments

No comments yet. Be the first.