Overview

This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:

Extracting existing evaluation tables from README content
Importing benchmark scores from Artificial Analysis
Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai)

When to Use

You need to add structured evaluation results to a Hugging Face model card.
You want to import benchmark data or run custom evaluations with vLLM, lighteval, or inspect-ai.
You are preparing leaderboard-compatible model-index metadata for a model release.

Integration with HF Ecosystem

Model Cards: Updates model-index metadata for leaderboard integration
Artificial Analysis: Direct API integration for benchmark imports
Papers with Code: Compatible with their model-index specification
Jobs: Run evaluations directly on Hugging Face Jobs with uv integration
vLLM: Efficient GPU inference for custom model evaluation
lighteval: HuggingFace's evaluation library with vLLM/accelerate backends
inspect-ai: UK AI Safety Institute's evaluation framework

Version

1.3.0

Dependencies

Core Dependencies

huggingface_hub>=0.26.0
markdown-it-py>=3.0.0
python-dotenv>=1.2.1
pyyaml>=6.0.3
requests>=2.32.5
re (built-in)

Inference Provider Evaluation

inspect-ai>=0.3.0
inspect-evals
openai

vLLM Custom Model Evaluation (GPU required)

lighteval[accelerate,vllm]>=0.6.0
vllm>=0.4.0
torch>=2.0.0
transformers>=4.40.0
accelerate>=0.30.0

Note: vLLM dependencies are installed automatically via PEP 723 script headers when using uv run.

IMPORTANT: Using This Skill

⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones

Before creating ANY pull request with --create-pr, you MUST check for existing open PRs:

uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"

If open PRs exist:

DO NOT create a new PR - this creates duplicate work for maintainers
Warn the user that open PRs already exist
Show the user the existing PR URLs so they can review them
Only proceed if the user explicitly confirms they want to create another PR

This prevents spamming model repositories with duplicate evaluation PRs.

All paths are relative to the directory containing this SKILL.md file. Before running any script, first cd to that directory or use the full path.

Use --help for the latest workflow guidance. Works with plain Python or uv run:

uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help

Key workflow (matches CLI help):

get-prs → check for existing open PRs first
inspect-tables → find table numbers/columns
extract-readme --table N → prints YAML by default
add --apply (push) or --create-pr to write changes

Core Capabilities

1. Inspect and Extract Evaluation Tables from README

Inspect Tables: Use inspect-tables to see all tables in a README with structure, columns, and sample rows
Parse Markdown Tables: Accurate parsing using markdown-it-py (ignores code blocks and examples)
Table Selection: Use --table N to extract from a specific table (required when multiple tables exist)
Format Detection: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)
Column Matching: Automatically identify model columns/rows; prefer --model-column-index (index from inspect output). Use --model-name-override only with exact column header text.
YAML Generation: Convert selected table to model-index YAML format
Task Typing: --task-type sets the task.type field in model-index output (e.g., text-generation, summarization)

2. Import from Artificial Analysis

API Integration: Fetch benchmark scores directly from Artificial Analysis
Automatic Formatting: Convert API responses to model-index format
Metadata Preservation: Maintain source attribution and URLs
PR Creation: Automatically create pull requests with evaluation updates

3. Model-Index Management

YAML Generation: Create properly formatted model-index entries
Merge Support: Add evaluations to existing model cards without overwriting
Validation: Ensure compliance with Papers with Code specification
Batch Operations: Process multiple models efficiently

4. Run Evaluations on HF Jobs (Inference Providers)

Inspect-AI Integration: Run standard evaluations using the inspect-ai library
UV Integration: Seamlessly run Python scripts with ephemeral dependencies on HF infrastructure
Zero-Config: No Dockerfiles or Space management required
Hardware Selection: Configure CPU or GPU hardware for the evaluation job
Secure Execution: Handles API tokens safely via secrets passed through the CLI

5. Run Custom Model Evaluations with vLLM (NEW)

⚠️ Important: This approach is only possible on devices with uv installed and sufficient GPU memory. Benefits: No need to use hf_jobs() MCP tool, can run scripts directly in terminal When to use: User working in local device directly when GPU is available

Before running the script

check the script path
check uv is installed
check gpu is available with nvidia-smi

Running the script

uv run scripts/train_sft_example.py

Features

vLLM Backend: High-performance GPU inference (5-10x faster than standard HF methods)
lighteval Framework: HuggingFace's evaluation library with Open LLM Leaderboard tasks
inspect-ai Framework: UK AI Safety Institute's evaluation library
Standalone or Jobs: Run locally or submit to HF Jobs infrastructure

Usage Instructions

The skill includes Python scripts in scripts/ to perform operations.

Prerequisites

Preferred: use uv run (PEP 723 header auto-installs deps)
Or install manually: pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests
Set HF_TOKEN environment variable with Write-access token
For Artificial Analysis: Set AA_API_KEY environment variable
.env is loaded automatically if python-dotenv is installed

Method 1: Extract from README (CLI workflow)

Recommended flow (matches --help):

# 1) Inspect tables to get table numbers and column hints
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"

# 2) Extract a specific table (prints YAML by default)
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  [--model-column-index <column index shown by inspect-tables>] \
  [--model-name-override "<column header/model name>"]  # use exact header text if you can't use the index

# 3) Apply changes (push or PR)
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  --apply       # push directly
# or
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  --create-pr   # open a PR

Validation checklist:

YAML is printed by default; compare against the README table before applying.
Prefer --model-column-index; if using --model-name-override, the column header text must be exact.
For transposed tables (models as rows), ensure only one row is extracted.

Method 2: Import from Artificial Analysis

Fetch benchmark scores from Artificial Analysis API and add them to a model card.

Basic Usage:

AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name"

With Environment File:

# Create .env file
echo "AA_API_KEY=your-api-key" >> .env
echo "HF_TOKEN=your-hf-token" >> .env

hugging-face-evaluation

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

MoneyPrinterTurbo

weather-svg-creator

telegram-bot-builder

segment-automation

Recibe nuevas skills de Automação todos los lunes