Error and Correctness Traps
Overview
Common bugs grouped by domain: floats that won't compare equal, retries that hammer a downed service, singletons that wreck testability, and others. When you write code in one of these domains, stop and run the matching checks before you commit.
This is a rigid skill. Jump to the sub-section that matches what you're writing and run that sub-section's checks.
These checks matter most when code will reach real users in production. In MVPs, prototypes, internal dev tools, and one-off scripts where the architecture is still in flux, prefer the simplest thing that works.
When to invoke
Invoke when you're about to:
- Add or change error-handling around a call that can fail
- Compare, sum, or accumulate floating-point numbers
- Write concurrent, parallel, or threaded code, or share mutable state between threads
- Call a remote process, web service, database, or another machine
- Introduce a singleton or any globally-shared mutable state
- Choose a data structure or algorithm on a path that runs often or on large inputs
- Add or change log statements that may fire at high volume
- Review code that handles errors, floats, concurrency, remote calls, singletons, or hot-path data structures
Non-triggers — do NOT invoke for
- Renaming a local variable inside one function
- Adding a docstring to an existing function
- Fixing a typo in a comment
- Formatting-only changes handled by a formatter
- Adjusting a config value in a config file with no logic change
- Skimming code for context without producing findings or edits
- An early-stage MVP or prototype where the architecture is still in flux
- An internal dev tool, debugging endpoint, or one-off script
- Throwaway code expected to be replaced before reaching users
If the change touches one of these domains even slightly, invoke anyway — the per-domain check is short and the bugs are not.
Checks by domain
Errors (97/21, 97/26, 97/29)
- Distinguish business exceptions from technical ones. A technical exception means the system can't proceed — bad arguments, broken DB connection, programming error. Let it bubble to a top-level handler that puts the system in a safe state (rollback, log, alert, friendly user message); the caller can't fix it. A business exception is part of the contract — withdrawing from an empty account, booking an unavailable slot — and is an alternative return path the caller is expected to handle. Give them separate types or hierarchies; mixing them blurs the contract. (Bergh Johnsson, 97/21.)
- Never write the empty
catch.try { ... } catch (...) {}silently swallows everything. Same for ignoring return codes (printf's return value,write()'s short-write count) and pretendingerrnodoesn't exist. Example: a service-call wrapper swallows every exception and returnsnull, so every downstream caller has to invent their own theory of whatnullmeans. Expose erroneous conditions in your interfaces; if handling errors feels onerous, the interface is wrong. (Goodliffe, 97/26.) - Don't rely on unexplained magic. If your change depends on behavior nobody can explain (build picks a DLL by load order, deployment reads an undocumented env var, a job runs because of a side effect in a config file), surface it in your summary to the user before shipping — don't bury the dependency. (Griffiths, 97/29.)
Numerics (97/33)
- Never compare floats with
==.0.1 + 0.2 != 0.3in IEEE 754 — the canonical demonstration. Compare with a tolerance appropriate to the magnitude of the values involved (≈ ε|x|, where ε is machine epsilon — ~1e-7 forfloat, ~1e-16 fordouble). - Watch for catastrophic cancellation. Subtracting nearly-equal floats promotes roundoff to the most significant digits. Example: solving
x² - 100000x + 1 = 0directly via the quadratic formula gives a wildly wrong small root because-b + sqrt(b² - 4)cancels; compute one root and derive the other fromr1 * r2 = c/a. Same shape of error appears in any series with alternating signs of similar magnitude. - Don't use float for money. Use a fixed-point or decimal type. Floats are for scientific calculation where you accept ε-level error; financial code does not accept it. (Allison, 97/33.)
Concurrency & IPC (97/41, 97/57)
- Default to message passing over shared mutable state. When you reach for a lock around shared data, ask first whether the data could be owned by one process/actor that others message. CSP-style designs (Erlang, Go channels, actor frameworks in mainstream languages) sidestep most race / deadlock / livelock bugs by construction. Reserve shared-memory + locks for cases you have measured and understood. (Winder, 97/57.)
- Count IPCs per user stimulus, not lines of code. Each remote call is non-trivial latency; sequential calls add. Example: ORM lazy-loading produces 1,000 sequential 10ms DB calls for one page render — minimum 10s response time before any rendering work. Ratios in the thousands appear routinely in slow apps. Apply parsimony (one round-trip carrying the right data), parallelism (overall latency = longest call, not sum), or caching. (Stafford, 97/41.)
- Retry with backoff and a cap, never in a tight loop. Example:
while (!call()) call();against a downed service hammers it the moment it comes back. Exponential backoff, jitter, and a max-retries ceiling are the minimum; idempotency on the server side is what makes retry safe at all.
Limits & Performance (97/46, 97/89)
- Know the complexity of the data structure you picked. Linked list vs. hash vs. balanced tree on a million items is the difference between snappy and unusable. Pick by access pattern (lookup-heavy → hash; ordered iteration → tree; tiny + cache-friendly → array), not by what's familiar. (van Winkel, 97/89.)
- Don't recompute invariants inside loops. Example:
for (i = 0; i < strlen(s); ++i)—strlenruns every iteration, scanning the whole string each time, turning O(n) work into O(n²). Hoist the length out. The same shape applies to repeated DB lookups, repeated config parses, and repeated regex compilations inside hot loops. (van Winkel, 97/89.) - Respect the cache hierarchy when it dominates. Register and L1 are nanoseconds; RAM is ~20ns; disk is ~10ms; network is ~20–100ms — orders of magnitude apart. A "worse" big-O algorithm with a predictable access pattern can beat a "better" one that thrashes cache. When perf matters, measure rather than reason from complexity alone. (Colvin, 97/46.)
Globals & Singletons (97/73)
- Resist the singleton. Most singletons encode a single-instance assumption that turns out to be premature, broadcast across the design as hidden coupling. They wreck unit-test independence (you can't substitute a mock), introduce subtle multi-threading bugs (naive locking slow, double-checked locking famously broken in several languages), and have no defined cleanup order at shutdown. Example: a
Logger.getInstance()called from every layer means tests can't intercept output, can't run in parallel, and inherit log state from previous tests. - If you genuinely need one instance, hide it behind an interface. Restrict the global access to a few well-defined construction sites; everywhere else, accept the dependency through a parameter typed by interface. Callers don't know whether a singleton or a fresh object satisfies the interface — and tests can substitute either. (Saariste, 97/73.)
Production resilience (RI/*)
When the call will run under load against a downstream that can fail, the per-call hardening is the first write. These checks matter most in production code.
- Set an explicit timeout on every remote call. Library defaults are wrong (
None, "infinity", "many minutes"). Pick a per-call budget based on the downstream's realistic latency plus margin, and cap retries inside that budget. *(