Testing: Write Tests That Catch Real Bugs
Write, structure, and maintain tests across unit, integration, E2E, accessibility, and performance layers. The goal is tests that catch regressions, document behavior, and run fast in CI - not tests that exist to inflate coverage numbers.
Target versions (May 2026):
- Vitest 4.1.2, Jest 30.3.0
- Playwright 1.59.0, Cypress 15.13.0
- pytest 9.0.2, pytest-cov 7.1.0
- Go 1.26.1 (testing stdlib,
testing/synctestGA) - Rust 1.94.1 (
cargo test, cargo-nextest 0.9.132) - Testing Library 16.3.2 (
@testing-library/react) - axe-core 4.11.2 (
@axe-core/playwright) - Grafana k6 1.7.1 (load testing)
When to use
- Writing new tests (unit, integration, E2E, accessibility, performance)
- Debugging flaky or failing tests
- Designing test architecture for a project (fixture strategies, factory patterns, test data)
- Setting up test infrastructure in CI (parallelization, sharding, coverage gates)
- Choosing testing tools or migrating between test frameworks
- Implementing TDD workflow
- Adding accessibility or visual regression tests to an existing suite
When NOT to use
- Reviewing existing test quality or correctness as part of a code review - use code-review
- Security-specific testing (penetration testing, OWASP checks) - use security-audit
- Cleaning up verbose/sloppy test code - use anti-slop
- Ad-hoc web browsing, scraping, or page interaction outside of tests - use browse
- CI/CD pipeline architecture (test jobs run inside pipelines, but pipeline design is ci-cd's domain) - use ci-cd
- Database testing patterns at the engine level - use databases
- Writing or refining LLM prompts (use prompt-generator)
- Infrastructure or configuration validation outside tests (use terraform, ansible, or kubernetes)
- AI/ML model evaluation or LLM output scoring - use ai-ml
- Infrastructure-level load or chaos testing beyond application tests (use kubernetes for cluster-level chaos, or ci-cd for pipeline-integrated load test orchestration)
AI Self-Check
AI tools consistently produce the same testing mistakes. Before returning any generated test code, verify against this list:
- Tests assert behavior, not implementation - no testing private methods or internal state
- Each test has exactly one reason to fail (single assertion concept, not single
assertcall) - Test names describe the scenario and expected outcome, not the method name
- Mocks/stubs are scoped to the test - no shared mutable mock state across tests
- No hardcoded ports, paths, or timestamps that break on other machines or in CI
- Async tests properly await all promises/futures - no fire-and-forget assertions
- Test data is isolated - each test creates its own state, no dependency on test execution order
- Cleanup happens even when assertions fail (use
afterEach/teardown/t.Cleanup/Drop) - No
sleep()or fixed delays for async waits - use polling, retries, or event-based waits - Coverage threshold is realistic (80% line coverage is a good default; 100% is a lie)
- Snapshot tests have been reviewed manually before committing (blind
--updateis a bug factory) - E2E selectors use
data-testid,role, or accessible names - not CSS classes or DOM structure - Current source checked: dated versions, CLI flags, API names, and support windows are verified against primary docs before repeating them
- Hidden state identified: local config, credentials, caches, contexts, branches, cluster targets, or previous runs are made explicit before acting
- Verification is real: final checks exercise the actual runtime, parser, service, or integration point instead of only linting prose or happy paths
- Routing overlap checked: overlapping skills, trigger terms, and "When NOT to use" boundaries are checked before returning guidance
- Spec claims verified: claims about tool behavior, output contracts, or repo conventions are checked against current docs, scripts, or skill files
- Runner APIs current: pytest, Vitest, Jest, Playwright, and Testing Library examples match current runner behavior
- Flake source identified: retries are not used to hide nondeterminism without diagnosis
Performance
- Split fast unit tests from integration, browser, and performance suites.
- Use fixtures and test data builders to avoid repeated expensive setup.
- Shard or parallelize only after isolating shared state, ports, databases, and clocks.
Best Practices
- Test behavior through stable public interfaces, not implementation details.
- Use stable roles/test IDs for UI tests; do not select generated CSS classes.
- Every regression fix gets a failing test that would have caught the bug.
Workflow
Step 1: Determine scope
Based on context:
- New feature -> write tests alongside or before the code (TDD when appropriate)
- Bug fix -> write a failing test first that reproduces the bug, then fix
- Existing untested code -> prioritize critical paths, not 100% coverage
- Test infrastructure -> set up runners, CI config, coverage gates
Identify the project's existing test framework from config files (vitest.config.ts, jest.config.*, pyproject.toml, Cargo.toml, *_test.go, playwright.config.ts). Match it. Don't introduce a second test runner without a reason.
Step 2: Choose the test layer
| Layer | Tests what | Speed | When to use |
|---|---|---|---|
| Unit | Single function/module in isolation | ms | Pure logic, utilities, data transforms, state machines |
| Integration | Multiple modules, real dependencies | seconds | API handlers, database queries, service boundaries |
| E2E | Full user flows through the UI | seconds-minutes | Critical paths, checkout flows, auth, onboarding |
| Accessibility | WCAG compliance, screen reader compat | seconds | Every user-facing component/page |
| Visual | Screenshot comparison | seconds | UI components after style changes |
| Performance | Load, latency, throughput | minutes | Before releases, after arch changes |
The testing pyramid still holds: many unit tests, fewer integration tests, fewest E2E tests. Invert it and your CI takes 45 minutes and everyone ignores test failures.
Step 3: Write the test
Follow the language-specific patterns below. Universal principles:
Arrange-Act-Assert (or Given-When-Then):
// Arrange: set up test data and dependencies
// Act: call the thing being tested
// Assert: verify the outcome
Test naming: describe the scenario, not the function.
# Bad: test_calculate_total
# Good: test_calculate_total_applies_discount_when_cart_exceeds_100
# Good: it("returns 401 when token is expired")
Step 4: Validate
- Run the full test suite: failures in other tests may indicate your change broke something
- Check coverage delta: new code should be covered, but don't chase vanity numbers
- Run in CI if possible - tests that pass locally but fail in CI are the worst kind
TDD Workflow
Use TDD when the behavior is well-defined upfront. Skip it when exploring or prototyping.
- Red: write a test that fails (confirm it fails for the right reason)
- Green: write the minimum code to make the test pass (ugly is fine)
- Refactor: clean up without changing behavior (tests still pass)
TDD works best for: pure functions, data transformations, state machines, API contracts, bug reproduction.
TDD works poorly for: UI layout, exploratory prototyping, integration with undocumented APIs.
Mocking Strategy
Mock at boundaries, not everywhere. Over-mocking produces tests that pass while the real code is broken.
| What to mock | What NOT to mock |
|---|---|
| External APIs (HTTP, gRPC) | Your own pure functions |
| Database (when unit |