evaluating-llms-harness

Name: evaluating-llms-harness
Rating: 5 (4 reviews)
Author: immacualate

Evaluates LLMs across 60+ academic benchmarks like MMLU and HumanEval. It's an industry standard used by EleutherAI, HuggingFace, and major labs for benchmarking model quality, comparing models, and tracking training progress, supporting HuggingFace, vLLM, and APIs.

4stars

Updated last month

View on GitHub ↗License: MIT

How to add

/plugin marketplace add immacualate/claude-forge

The exact command may vary by repository. Check the README on GitHub.

For the skill author

Drop this on your repo README

Shows your skill is listed on Skillteca, generates a backlink and trackable traffic.

[![Listada na Skillteca](https://www.skillteca.com.br/api/badge/evaluating-llms-harness/svg)](https://www.skillteca.com.br/skills/evaluating-llms-harness?utm_source=badge&utm_medium=readme&utm_campaign=badge)

#llm #ai #api

Related skills

See all in Desenvolvimento →

claude-api

153.1k

Build, debug, and optimize Claude API / Anthropic SDK apps. Apps built with this skill should include prompt caching. Also handles migrating existing Claude API code between Claude model versions (4.5 → 4.6, 4.6 → 4.7, retired-model replacements). TRIGGER when: code imports `anthropic`/`@anthropic-ai/sdk`; user asks for the Claude API, Anthropic SDK, or Managed Agents; user adds/modifies/tunes a C

Desenvolvimento#ai#apiby anthropics

skill-creator

153.1k

Create new skills, modify and improve existing ones, and measure their performance. Use this skill for developing, editing, optimizing, testing, and benchmarking skills, as well as refining their descriptions for better triggering.

Desenvolvimento#testby anthropics

claude-mem

83.4k11

Captures agent actions across sessions, compresses them with AI, and injects relevant context into future interactions. Compatible with Claude Code, OpenClaw, Codex, Gemini, Hermes, Copilot, OpenCode, and more.

Desenvolvimento#aiby thedotmack

oh-my-issues

83.4k

This skill clusters GitHub issue backlogs by root cause into plan-master issues, redirects related children, and bundles architectural-fix PRs to close clusters atomically. It's ideal for triaging and consolidating numerous issues sharing underlying defects, or for building a plan series or roadmap.

Desenvolvimento#github#gitby thedotmack

Category alert

Get new Desenvolvimento skills every Monday

One short email with only the new Desenvolvimento skills. 4 minutes of reading, no spam, unsubscribe with one click.

You confirm your email on the first send. No spam. Unsubscribe with one click.

lm-evaluation-harness - LLM Benchmarking

Quick start

lm-evaluation-harness evaluates LLMs across 60+ academic benchmarks using standardized prompts and metrics.

Installation:

pip install lm-eval

Evaluate any HuggingFace model:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu,gsm8k,hellaswag \
  --device cuda:0 \
  --batch_size 8

View available tasks:

lm_eval --tasks list

Common workflows

Wo

[Description truncada. Veja o README completo no GitHub.]

ShareX LinkedIn

Comments · No comments

No comments yet. Be the first.