Testing Discipline
Overview
A good test pins the contract, uses concrete examples, names the scenario in domain language, and uses test data safe to show a customer. Run the checklist in order when writing or reviewing a test.
When to invoke
Invoke when you're about to:
- Write a new test (unit, integration, end-to-end, characterization)
- Name or rename a test method, test class, or test file
- Design a fixture, mock, stub, or shared test helper
- Choose test data — names, IDs, sample strings, sample numbers
- Decide what to assert in a test (exact value? structural property? both?)
- Schedule a long-running suite (soak, perf, cross-platform matrix)
- Hand off requirements between programmers and testers, or work on acceptance tests
- Review existing tests for quality, coverage gaps, or test smells
- Evaluate whether a test suite adequately covers a contract
Non-triggers — do NOT invoke for
- Running the existing test suite (
npm test,pytest) - Fixing a one-line broken assertion where the intent is unchanged
- Adding a docstring or comment to an existing test
- Reordering imports in a test file
- Updating a snapshot file when the underlying intentional change is already understood
If you're not sure whether a change counts as "writing a test," invoke anyway — the checklist is short and skipping it produces tests that pass forever while protecting nothing.
Test quality checklist
Run every step in order when writing a new test. When reviewing, use the same checks to evaluate quality.
- Tests are part of the change, not optional polish. Software's only practical pre-deployment validation is execution under realistic conditions. A change without tests is an incomplete change. (Ford, 97/83.)
- Assert the required behavior, not an incidental of the current implementation. Ask: what does the contract actually promise? For a 3-way comparator the contract is "negative if less, positive if greater, zero if equal" — not "exactly -1 or +1." Asserting ±1 nails an incidental and will go red the day someone returns -2. Tests that mirror code structure end up asserting that the code does what the code does. (Henney, 97/80.)
- Be precise and accurate. Use concrete examples. "Result is sorted and same length" passes for
[3,3,3,3,3,3]against an input of[3,1,4,1,5,9]. The full postcondition is sorted and a permutation of the input — but expressing that as a generic checker is often more code than the function under test. Prefer a concrete example pair: input[3,1,4,1,5,9], expected[1,1,3,4,5,9]. "Adding to an empty collection" is not "now non-empty" — it's "now contains exactly one item, and that item is X." (Henney, 97/81.) - Write the test for the next person who has to read it. A good test is documentation. Make three parts visible in this order: the context/preconditions, the call into the system, the expected result. Hide trivia behind named helpers (Extract Method) so the reader sees the scenario, not the scaffolding. Give the test a name describing the scenario and the entry point —
Stack_pop_on_empty_throwsreads better thantest17. Then test the test: introduce a deliberate bug into the code under test on a private branch and verify the failure message tells you what went wrong. (Meszaros, 97/95.) - Choose test data that is safe in front of a customer. Placeholder names, fake company names, sample log strings, mock error dialogs, and seeded database rows have a habit of surfacing in screenshots, demos, leaked source, and production logs. Do not use real people's names you mean to mock, do not use band names or song titles as a private joke, do not write
"don't click that again, you moron"as a placeholder dialog, do not use four-letter words as fake stock tickers. Use boring, obviously-fake, professional data. (Begbie, 97/25.) - Accept acceptance tests as requirements input. When acceptance tests exist for a behavior (from QA, from the customer, from a spec), use them as input to the work — they surface ambiguities cheaper than defect tickets. (Hufnagel, 97/60; Gregory, 97/92.)
Red Flags
These thoughts mean STOP — restart the checklist:
| Thought | Reality |
|---|---|
| "I'll just assert the function returns exactly -1." | The contract says "negative" — ±1 is an implementation incidental. The test will go red on a valid refactor and tell you nothing about the requirement. (97/80) |
| "Same length, all elements in range — that's enough for a sort test." | [3,3,3,3,3,3] satisfies that and is wrong. Use a concrete input/output pair so the only correct answer is the one in the assertion. (97/81) |
"I'll seed the database with band members and song titles — funnier than User1." | Cute test data ends up in customer screenshots, demos, and leaked source. Use boring, obviously-fake, professional data. (97/25) |
| "No time to write the test — we'll add it later." | "Later" rarely arrives, and shipping without verification is professionally irresponsible. The test is part of the change. (97/83) |
| "I know what this test means — I just wrote it." | You will not, in six months. Tests are read more than they are written. Name the scenario, structure as context/act/assert, hide scaffolding. (97/95) |
"test17 is a fine name — the body explains it." | Test names are scanned to verify coverage and to read failure reports. Encode the scenario and the entry point in the name. (97/95) |
| "The test setup is fifty lines, then a one-line assert — but it works." | Test pain is design pressure. If the setup dwarfs the body, reshape the production code. Do not mock harder. (GOOS/ListenToTestPain) |
"I'll assert that repo.save was called exactly three times." | Over-specifying mock interactions makes the test red on innocent refactors. Assert on the observable contract, not the call shape. (xUnit/FragileTest) |
"The test reads from /tmp/fixtures/users.json set up in conftest.py." | Mystery Guest. The test cannot be read in isolation. Build the fixture in the test or in a function named for what it returns. (xUnit/MysteryGuest) |
| "I'll branch on the return value and assert different things in each branch." | One test, two scenarios fighting for one name. Split into two tests, or use named parameterized cases. (xUnit/ConditionalTestLogic) |
What "done" looks like
A single well-written test is done when all of the following are true:
- The assertion targets the contract, not an incidental of the current implementation.
- The expected result is concrete enough that a reader can check it by eye in under thirty seconds.
- The test name describes the scenario and the entry point in domain language.
- Context, action, and expected result are visible as three readable sections; scaffolding is behind named helpers.
- Test data (names, IDs, strings, log lines, error messages) is safe to appear in a customer screenshot.
- You verified the test fails for the right reason — by introducing a deliberate bug on a private branch and reading the failure message.
If any box is unchecked, the test is not done. Either finish, or delete it and start over.
Principles in this skill
| # | Principle | Author |
|---|---|---|
| 97/25 | Don't Be Cute with Your Test Data | Rod Begbie |
| 97/60 | News of the Weird: Testers Are Your Friends | Burk Hufnagel |
| 97/80 | Test for Required Behavior, Not Incidental Behavior | Kevlin Henney |
| 97/81 | Test Precisely and Concretely | Kevlin Henney |
| 97/82 | Test While You Sleep (and over Weekends) | Rajith Attapattu |
| 97/83 | Testing Is the Engineering Rigor of Software Development | Neal Ford |
| 97/92 | When Programmers and Testers Collaborate | Janet Gregory |
| 97/95 | Write Tests for People | Gerard Meszaros |
GOOS/ListenToTestPain | Listen to Test Pain | Steve Freeman & Nat Pryce |
xUnit/ObscureTest | Obscure Test | Gerard Meszaros |
xUnit/FragileTest | Fragile Test | Gerar |