Autonomous Agent Nightshift
A playbook + reusable bash harness for running Claude Code agents overnight. The user writes a feature plan (todo file with checkboxes), the harness orchestrates implement → validate → fix loop → mark done, and the user wakes up to a validated diff (or a PR with review feedback already addressed, in Bulletproof mode).
This skill drives three core workflows:
- Setup — initialize a new nightshift run on the user's project
- Morning review — triage results after a completed run
- Bulletproof — production-hardening sweep with branch + PR + review-comment healing
All bundled assets live under this skill's directory:
scripts/
run-agent-loop.sh Classic feature-implementation runner
nightshift-bulletproof.sh Branch + commit + PR + review variant
start-nightshift.sh start/stop/status/tail launcher
test-nightshift.sh Recursive validation loop (no Claude)
templates/
todo-template.md Feature-plan skeleton
codebase-context.md Heredoc fill-in for the runner
qa-checklist-template.md Production-ship checklist
runner-config.env Tuning presets
docs/
01-playbook.md Full master guide (read first)
02-bulletproof-mode.md PR-loop variant
03-chrome-testing.md Live browser MCP testing
04-qa-checklist.md Writing production checklists
05-failure-modes.md Cheatsheet of seen failures
06-test-loop.md Recursive validation loop
examples/
todo-simple-example.md Synthetic 5-task dark-mode toggle (start here)
bulletproof-steps-example.md Synthetic 10-step production hardening sweep
qa-checklist-saas.md Real 22-section SaaS checklist (sanitized)
bulletproof-summary.log Real 100-step run timeline (sanitized)
Workflow 1 — Setup (new run)
Trigger: user says "set up nightshift", "I want to run an agent overnight to build X", "let's automate this feature build", or similar.
Step 1: Detect project stack
Read these files (if present) to detect framework, test runner, package manager, lint/format tools:
package.json(look at scripts, dependencies)tsconfig.jsonrequirements.txt/pyproject.toml/Cargo.toml/go.modvitest.config.*,jest.config.*,playwright.config.*,bun.lockb.eslintrc.*,.prettierrc*,biome.json
Record: stack, package manager (npm / pnpm / bun / yarn), lint command, type-check command, test command, dev-server command + port.
Step 2: Clarify the feature plan
Ask the user what they want built. Aim for:
- One sentence per task: a specific outcome, not a vague goal
- Each task small enough to fit in one Claude session (~5–15 min)
- No task depends on more than 2 prior tasks
- Number tasks sequentially across phases
If the user gives a vague request ("polish the dashboard"), help them split it into specific tasks. Bad: "improve the UI." Good: "Add a stop button that aborts the SSE stream and shows during generation."
Step 3: Write the todo file
Use templates/todo-template.md as the structure. Create todo-{YYYY_MM_DD}_{name}.md at the project root. For each task, fill both Implementation and Validation with specifics:
- Implementation must name the exact files, functions, props, classes to touch — and reference any existing hook/pattern the agent should reuse
- Validation must be a testable assertion ("Assert that X returns Y when given Z"), not "make sure it works"
Step 4: Generate the codebase context
Explore the user's repo (Glob + key file reads) and produce a CODEBASE_CONTEXT heredoc following templates/codebase-context.md. Include:
- File paths the agent might touch, one-liner per file
- Architecture patterns (state management, data flow)
- Conventions (formatting, naming, testing imports)
- Forbidden patterns (no
any, no@ts-ignore, etc.)
Step 5: Copy + customize the runner
Copy scripts/run-agent-loop.sh and scripts/start-nightshift.sh into the project root. Edit run-agent-loop.sh:
- Set
TODO_FILEto the file you just created - Paste the codebase context into the
CODEBASE_CONTEXTheredoc - Update
run_full_validation()for the user's toolchain. Per-stack patterns:- Node/TS:
npx prettier --write . && npx tsc --noEmit && npx eslint . --quiet && bun test - Python:
black . && mypy . && ruff check . && pytest - Go:
gofmt -w . && go vet ./... && golangci-lint run && go test ./... - Rust:
cargo fmt && cargo clippy -- -D warnings && cargo test
- Node/TS:
- Adjust
DEV_PORTto match the user's dev server - Tune limits using the formula
MAX_ITERATIONS >= NUM_TASKS × (1 + MAX_FIX_ATTEMPTS + 2). Default for overnight:MAX_ITERATIONS=200,MAX_FIX_ATTEMPTS=7.
Step 6: Pre-flight check
Verify (and report each as ✓ or ✗):
-
claudeCLI authenticated (claude --version) - Test runner installed and clean baseline (
bun test && npx tsc --noEmitreturns 0) - No uncommitted changes the user cares about (the agent will modify files)
-
.agent-logs/and.claude_iterationsadded to.gitignore - (If Chrome testing) Chrome open with Claude-in-Chrome extension showing "Connected"
- (If Chrome testing) User logged into the app for any auth-walled routes
- (If Bulletproof)
gh auth statusgreen;GITHUB_REPOconfig set
Step 7: Launch instructions
Don't launch for the user. Tell them:
chmod +x run-agent-loop.sh start-nightshift.sh
./start-nightshift.sh start # detached, survives terminal close
./start-nightshift.sh tail # follow the summary log
Then ./start-nightshift.sh stop to halt.
Workflow 2 — Morning review
Trigger: user says "review my nightshift", "what happened overnight", "check the agent's work", or runs ./start-nightshift.sh status and asks for help interpreting.
Step 1: Read the summary log
cat .agent-logs/nightshift-summary.log
Identify:
- Tasks completed on first try
- Tasks completed after N fix attempts
- Tasks marked
FAILED(exhausted retries) - Tasks with
CHROME REVIEW NEEDED(code passed, visual unverified) - Total runtime, total iterations used
Step 2: Find the manual-review tasks
grep 'NEEDS MANUAL REVIEW' todo-*.md
grep 'CHROME REVIEW NEEDED' todo-*.md
For each, read the task's body + .agent-logs/task-{N}.log to understand why it failed.
Step 3: Surface the diff
git diff --stat
git diff --stat ${BASE_BRANCH:-main}..HEAD # if a branch was created
Summarize what changed by area (routes, components, tests, configs).
Step 4: Spot-check visual evidence
If Chrome testing ran:
ls .agent-logs/screenshots/
For each REVIEW-flagged UI task, open the screenshot/GIF and verify visually (or instruct the user to).
Step 5: Report
Produce a structured summary:
Completed: N tasks (M on first try, K with fixes)
Failed: N tasks — [list with file:line of the issue]
Chrome review needed: N tasks — [list]
Diff: +X / -Y across Z files
Suggested next steps:
1. Verify [task N] manually — [why]
2. Run `bun test` to confirm clean baseline
3. Commit accepted work: git add -p && git commit
Do not auto-commit. The human stages and commits.
Workflow 3 — Bulletproof (production hardening sweep)
Trigger: user says "bulletproof my codebase", "production-ship sweep", "harden before launch", "run an overnight PR loop", or similar.
Step 1: Audit and propose steps
Read the codebase. Produce a BULLETPROOF-STEPS.md organized by category:
## Category 1 — Styling & Design System
### Step 1 — {specific hardening action}
**Implementation:** ...
**Validation:** ...
Common categories (pick what applies to the user's codebase):
- Styling & Design System
- Performance
- Accessibility
- Security headers + CSP
- Auth + session hardening
- Rate limiting + quota
- Error boundaries + observability
- SEO + meta
- Mobile + responsiv