Homework Grader
A course-agnostic, Rubric-driven evaluation engine for grading student homework with Claude. All course-specific knowledge lives in user-defined Rubric YAML files; this Skill provides the scoring methodology, quality control framework, and batch processing pipeline.
When to Activate
Activate this Skill when the user:
- Asks to grade, score, or evaluate student homework or assignments
- Wants to create a rubric or scoring criteria for coursework
- Needs to batch-process a set of student submissions
- Asks about calibrating AI scoring against teacher standards
- Wants to export grades to Excel or generate grade reports
- Mentions PDCA, quality control, or bias checking in grading context
- References homework, assignment, submission, coursework evaluation
Keywords: grade homework, score assignments, rubric, evaluate student work, batch grading, calibrate scoring, export grades, feedback comments, PDCA cycle
Core Concepts
Rubric-Driven Design
Every scoring decision traces back to a Rubric YAML file that defines:
- Criteria with weights, 1-5 anchors, and evidence types
- Gates for pre-scoring validation (keyword, structure, length, custom)
- Thresholds for accept/review/reject classification
- Comment guidelines for feedback language, tone, and structure
The Skill never invents criteria. If the Rubric doesn't define it, it doesn't get scored.
Direct Scoring Method
Each submission is scored independently against absolute standards (not compared to peers). This is the correct method when objective criteria exist — which Rubrics provide by definition.
- Scale: 1-5 Likert (integer scores per dimension)
- Process: Evidence → Reasoning → Score (never reversed)
- Aggregation: Weighted sum across dimensions
PDCA Quality Cycle
Every grading batch follows Plan → Do → Check → Act:
- Plan: Define/validate Rubric, prepare calibration samples
- Do: Preprocess submissions, run AI scoring, generate comments
- Check: Calibrate against teacher scores, check distributions, detect bias
- Act: Human review of flagged items, refine Rubric for next round
Multimodal Support
Submissions are preprocessed into a unified Intermediate Representation (IR) before scoring. Supported modalities:
- Text (P0): docx, pdf → Markdown text
- Image (P1): jpg, png → Claude Vision structured descriptions
- Video (V2): mp4 → keyframes + transcript (future)
- Mixed: Combination of above
PDCA Workflow
Phase 1: Plan
Goal: Establish scoring standards and validation baseline.
| Step | Action | Output | Exit Criterion |
|---|---|---|---|
| 1.1 | Define or load Rubric YAML | rubric.yaml | Passes schema validation |
| 1.2 | Validate Rubric | Validation report | Weights sum to 1.0, anchors complete, gates well-formed |
| 1.3 | Prepare calibration samples | 3-5 teacher-scored samples | Cover good/medium/poor range |
| 1.4 | Configure batch parameters | Processing config | Submission format, batch size, mode |
Exit: Rubric validated + calibration samples ready + teacher confirms.
Failure: Invalid Rubric → fix and re-validate. No calibration samples → teacher must provide at least 3 before proceeding to Do phase.
Phase 2: Do
Goal: Process all submissions and produce AI scores.
| Step | Action | Output | Exit Criterion |
|---|---|---|---|
| 2.1 | Collect submissions | workspace/raw/ | All files present and readable |
| 2.2 | Preprocess → IR | workspace/ir/ | Each submission has valid IR JSON |
| 2.3 | Run gate checks | Gate results in IR | All gates executed, failures recorded |
| 2.4 | Score each submission | workspace/scores/ | Each has dimension scores + comment |
| 2.5 | Generate comments | Comments in score records | 200-400 chars, three sections |
Exit: All submissions scored (or failed items logged).
Failure: API errors → retry with exponential backoff (max 3). File corruption → log and skip. Parse errors → retry up to 2 times, then flag for manual.
Phase 3: Check
Goal: Validate AI scoring quality.
| Step | Action | Threshold | On Failure |
|---|---|---|---|
| 3.1 | Calibration: AI vs teacher on samples | κ ≥ 0.70, ρ ≥ 0.80 per dimension | → Back to Plan: adjust anchors |
| 3.2 | Distribution check | |skewness| < 1.0, no >40% concentration | → Spot-check extreme scores |
| 3.3 | Bias detection | Length-score |ρ| < 0.3, position-score |ρ| < 0.2 | → Adjust prompts, re-score |
| 3.4 | Confidence filtering | ≤20% mandatory review (conf < 0.6) | → Review flagged items |
Exit: All checks pass, or teacher accepts results after reviewing issues.
Failure: κ < 0.70 → return to Plan phase, revise Rubric anchors. Significant bias → adjust scoring prompts and re-run Do phase.
Phase 4: Act
Goal: Finalize grades and capture lessons.
| Step | Action | Output |
|---|---|---|
| 4.1 | Human review of flagged items | Corrected scores |
| 4.2 | Export to Excel | Grade spreadsheet |
| 4.3 | Record Rubric adjustments (if any) | Updated Rubric version |
| 4.4 | Log lessons learned | Improvement log for next cycle |
Exit: Final grades exported + Rubric version updated if changed.
Rubric Schema
A Rubric is a YAML file with the following structure. See
templates/rubric.yaml.tmpl for a copy-paste template.
Required Fields
rubric:
id: "course-assignment-v1.0" # Unique identifier
name: "Human-readable name"
version: 1.0
criteria:
criterion_id:
name: "Dimension Name"
weight: 0.30 # All weights MUST sum to 1.0
scale: [1, 2, 3, 4, 5]
description: "What this measures"
scoring_guidance: "How to evaluate"
anchors:
5: "Excellent — observable criteria"
4: "Good — observable criteria"
3: "Adequate — observable criteria"
2: "Below average — observable criteria"
1: "Poor — observable criteria"
evidence_type: quote # quote | observation | metric
thresholds:
accept: 3.0
reject: 1.5
review: [1.5, 3.0] # Must equal [reject, accept]
Optional Fields
Tip: The
templates/rubric.yaml.tmpltemplate includes additional optional fields (created,updated,author,course.code,course.semester,gate.description,notes) not listed here. They are informational metadata — the scoring engine ignores them, but they help with Rubric management.
course: # Remove entirely if not needed
name: "Course Name"
submission_type: text # text | image | video | mixed
expected_formats: [docx, pdf]
student_count: 100
gates: # Pre-scoring checks
- id: "G-001"
name: "Gate Name"
check_method: keyword # keyword | structure | length | custom
parameters: { keywords: [...], min_count: 1 }
on_fail: flag # fail | flag | warn
comment_guidelines:
tone: "constructive, specific"
language: "zh-CN"
length_range: [200, 400]
required_sections: [strengths, weaknesses, suggestions]
prohibited_patterns: [...]
history:
- version: 1.0
date: "2026-01-01"
changes: ["Initial version"]
Validation Rules
| Rule | Check |
|---|---|
| Weights | sum(criteria.*.weight) = 1.0 (±0.001) |
| Anchors | Every value in scale has an anchor description |
| Thresholds | accept > reject; review = [reject, accept] |
| Gate IDs | Unique within the Rubric |
| Gate on_fail | One of: fail, flag, warn |
| evidence_type | One of: quote, observation, metric |
Scoring Protocol
This is the complete protocol for scoring a single submission. Claude executes this directly — no external scrip