Performance is a contract with reality. Intuition about hot paths is wrong more often than right. Capture, locate, hypothesize, optimize, re-measure, prove the regression with a benchmark — then defend the win with an invariant.

When to Apply / NOT

Apply: latency or throughput SLO violation; memory pressure (RSS growth, GC churn); benchmark regression; pre-optimization scoping; cold-start vs steady-state cost split; cache locality / branch-prediction concerns.

NOT apply: defect with wrong outputs; architectural redesign; micro-optimization without budget pressure; untested code (write tests first).

Anti-patterns

Optimize without profile: intuition-driven changes.
Single-run benchmarks: variance dominates. Use hyperfine --warmup 3 --min-runs 10.
Profile in debug build: optimizer-disabled binaries lie.
Confuse flat profile with call-graph: self-time vs total-time tell different stories.
Ignoring tail latency: p50 stays flat while p99 explodes.
Cherry-picking the win: re-measure end-to-end.
Allocation blind spot: CPU profiler hides GC.
Forgetting the regression guard.

Workflow (language-neutral)

Define budget — restate target metric: latency p95 < X ms, throughput > Y rps, RSS < Z MB.
Establish baseline — run unoptimized workload under hyperfine plus profiler. Save raw artifacts.
Capture profile — sampled CPU profile → flamegraph; allocation profile if memory-bound; latency histogram for tail.
Locate hotspot — top self-time function or widest plateau. Cross-check with allocation profile.
Hypothesize — one falsifiable claim with predicted delta.
Optimize minimally — smallest change targeting the hypothesis.
Re-profile — capture same metric; differential flamegraph.
Prove the win — hyperfine 'baseline' 'optimized' --warmup 3 --min-runs 10.
Guard the win — add CI benchmark with regression bound.

Reading Flamegraphs

Wide plateau on top: hot self-time function — primary target.
Narrow towers: deep call chains — examine for over-abstraction.
Repeated motifs: same callee under many parents — candidate for inlining or caching.
Missing frames: rebuild with -fno-omit-frame-pointer / RUSTFLAGS=-C force-frame-pointers=yes.
Differential flamegraph (hotspot --diff): red = added cost, blue = removed.

Parallel Tooling

Family	CPU sampling	Memory / alloc	Differential / benchmark
Systems (C/C++/Rust)	`perf record` + `flamegraph`, `samply`, `valgrind --tool=callgrind`	`valgrind --tool=massif`, `heaptrack`, `dhat`	`hyperfine`, `criterion`, `iai-callgrind`
Python	`py-spy record`, `scalene`, `cProfile` + `snakeviz`	`scalene` (memory mode), `tracemalloc`, `memray`	`pytest-benchmark`, `hyperfine`
Go	`go tool pprof` (cpu profile), `runtime/pprof`	`go tool pprof` (heap), `runtime.MemStats`	`go test -bench -benchmem`, `benchstat`
Java/Kotlin	`async-profiler`, JFR (`jcmd JFR.start`)	JFR allocation events, Eclipse MAT on heap dump	JMH, `hyperfine`
JavaScript/TypeScript	Chrome DevTools, `node --prof`, `clinic flame`	Chrome heap snapshot, `clinic doctor`	`tinybench`, `mitata`, `hyperfine`
OCaml	`landmarks`, `magic-trace`	`memtrace` + `memtrace-viewer`, `Statmemprof`	`bechamel`, `core_bench`, `hyperfine`

Use procs (not ps). Use bat -P -p -n (not cat). Use difft (not diff). Use hyperfine (not time).

Constitutional Rules

No optimization without a profile.
One hypothesis per iteration.
Re-measure end-to-end.
Variance-aware comparison with hyperfine --warmup --min-runs.
Guard every win in CI bench.
Correctness first — keep TDD green.

perf-profile

How to add

Drop this on your repo README

Related skills

learn-codebase

remove-deadcode

apify-competitor-intelligence

ad-creative

Get new Marketing skills every Monday