Published skills
pipeline-eval
A system-level evaluation framework for multi-stage LLM pipelines, scoring the entire pipeline across 8 dimensions including input/output quality and prompt design. It complements `deepeval` by evaluating the pipeline architecture itself, rather than single content artifacts.
pipeline-eval
A system-level evaluation framework for multi-stage LLM pipelines, scoring the entire pipeline across 8 dimensions like input/output quality and prompt design. It complements `deepeval` by evaluating the pipeline architecture itself.
deepeval
A BCG-calibrated evaluation framework for LLM agent outputs, featuring a Claude-native judge and a 4-tier stack. It includes an 8-dimension BCG rubric, a 10-signal novelty stack, and an adversarial Skeptic Agent, designed for daily, weekly, or 30-day cadences and integrable into any Claude Code project without API keys.
pipeline-eval
A system-level evaluation framework for multi-stage LLM pipelines, scoring the entire pipeline across 8 dimensions like input/output quality, prompt design, and fact-grounding. It complements `deepeval` by evaluating the pipeline architecture itself, rather than single content artifacts.
deepeval
A BCG-calibrated evaluation framework for LLM agent outputs, featuring a Claude-native judge (no external API). It includes a 4-tier stack with an 8-dimension BCG rubric and an adversarial Skeptic Agent, integrating into any Claude Code project without API keys.
deepeval
A BCG-calibrated evaluation framework for LLM agent outputs, featuring a Codex-native judge and a 4-tier stack with an 8-dimension BCG rubric and a 10-signal novelty stack. It includes an adversarial Skeptic Agent for sycophancy and ambiguity probes, supports day/week/30-day cadences, and integrates into any Codex project without API keys.
Category alert