Observability Architecture — Signal Production
What an application emits about itself: logs, metrics, traces. Pairs with grafana-architect which owns the consumption side (dashboards, alerts). This skill enforces what gets emitted, how it's named, and how the three pillars correlate. Wiring code in RECIPES.md; pinned libraries in STACK.md.
1. Three pillars + correlation
Each pillar answers a different question:
| Pillar | Question | Cardinality | Cost |
|---|---|---|---|
| Logs | "What happened on this request?" | High (per event) | High storage |
| Metrics | "What is the rate / aggregate?" | Low (aggregated) | Low storage |
| Traces | "What was the path?" | One per request | Medium |
The design goal is correlation: from any signal you can pivot to the others. A trace shows a slow request → click → see the logs from that request → see the metric spike around the same time. This is what makes observability worth the cost.
Correlation mechanism: trace_id everywhere.
- Every log line carries the current
trace_idandspan_id. - Every metric exposes the current
trace_idas an exemplar (Prometheus exemplars). - Every trace span carries the operation name, attributes, and status.
If you skip the correlation, you have three independent data lakes — useful in isolation, painful to cross-reference. Concrete wiring in RECIPES §1.
2. Metrics: Prometheus
Why Prometheus over OTel metrics in 2026: Prom client libs are mature in every language, the exposition format is universal, ops engineers already know rate() / histogram_quantile(). OTel metrics are catching up but not yet at parity for ergonomics.
Naming
Follow the Prometheus / OpenMetrics convention:
- Pattern:
<namespace>_<subsystem>_<name>_<unit>_<type> <namespace>= application name (orders,payments).<unit>is always present for sample values:_seconds,_bytes,_total,_ratio,_celsius. Never_ms, never bare units.<type>suffix for cumulative counters:_total. Gauges and histograms don't take a type suffix.
orders_http_requests_total{method="POST", route="/v1/orders", status="201"}
orders_http_request_duration_seconds_bucket{le="0.5", route="/v1/orders"}
orders_db_connections_active # gauge, no suffix
Types
- Counter — monotonically increasing;
_totalsuffix; reset on process restart. Query withrate(). - Gauge — point-in-time value, up or down. Active connections, queue depth, in-flight requests.
- Histogram — distribution of values; choose buckets explicitly. Default latency buckets in RECIPES §4.
- Summary — pre-computed quantiles; avoid unless you can't aggregate across instances (you usually can — use histogram).
Cardinality
The single most expensive observability mistake: high-cardinality labels.
- Bad:
user_id,email,request_id,correlation_id, anything unique per request. One time series per unique value — millions of series, ruined retention. - Good:
method,route(templatized, not the raw path),status_class(2xx,4xx,5xx),tenant,region. Bounded sets. - Templatize routes to bounded patterns:
/v1/users/{id}not/v1/users/01J9X.... < 100distinct values per label as a soft cap; review labels with more.
3. Logging: structured, leveled, correlated
- Structured JSON to stderr (per cli-tool-architect §6). One log line = one JSON object.
- Stdlib first:
- Go:
log/slogper go-architect §5.slog.JSONHandlerin production. - Python:
structlogconfigured to emit JSON.
- Go:
- Mandatory fields:
timestamp(RFC 3339 UTC),level,msg(fixed string),trace_id/span_id(when in a traced request),service. - Don't log full payloads. A request body field can be PII; the bytes are dead weight even when it isn't. Log the shape:
request_size_bytes,field_count,customer_id. msgis a fixed string for grep-ability:msg="user created"withuser_idandemail_domainas separate fields, notmsg=f"created user {email}".- One level per environment: prod
info, staginginfo/debug, devdebug. - Errors include the operation that failed + the inputs that mattered. No stack traces in
msg; stack traces are a separateerror.stackfield.
4. Tracing: OpenTelemetry
OTel for traces is unambiguously the right choice — vendor-neutral, well-instrumented per language, supports all major backends (Jaeger, Tempo, Honeycomb, Datadog).
- Auto-instrumentation first. OTel libs for net/http, gin, FastAPI, requests, psycopg, grpc cover 80% of useful spans for free. Add manual spans only where business boundaries deserve them.
- Span naming: verb + resource.
POST /v1/orders,db.query,kafka.publish. Templatized, low cardinality. - Attribute keys follow OTel semantic conventions:
http.method,http.status_code,db.system,messaging.system. Don't invent your own. - Set span status on errors:
span.SetStatus(codes.Error, msg)— backends colorize error spans. - Don't trace everything. Per-call spans for hot loops kill performance. Trace the request boundary, major sub-operations (DB call, external API call, queue publish), and failure paths.
5. Correlation rules
Three rules to make signals jump between each other:
trace_idin every log line. Pull from the OTel SDK's current context — every modern logging integration supports this.- Prometheus exemplars on histograms (especially latency). When a slow bucket increments, the exemplar records the
trace_id. Grafana lets you click from a histogram bucket directly to the trace. service.nameandservice.versionas resource attributes on traces + as labels on metrics + as fields on logs. Ties signals across deploys and versions.
Wiring snippets in RECIPES §1.
6. Sampling
- Head sampling at 10% in production. Deterministic per
trace_idso a request is either fully sampled or fully dropped — no half-traces. - 100% sampling on errors — always keep the trace when span status is error. Cheap, decisive ROI.
- 100% sampling in non-prod (dev, staging) — volume is low; you want full visibility.
- Tail sampling via the OTel Collector is the upgrade when 10% head misses interesting low-volume endpoints. Adds Collector infrastructure. Defer until needed.
- Configure via env vars — see RECIPES §5.
7. What NOT to emit
PII and secrets never appear in any signal:
- No email addresses, names, addresses, phone numbers in attributes or fields. Use
email_domain(acme.com) to slice by tenant. - No passwords, tokens, API keys, session cookies — not in headers, not in payloads. Use redaction filters at the SDK level — defense in depth, since one careless
log.Info("creating user", user)can leak it all. - No full payloads. Log size and shape, never contents.
- No internal IDs that could be re-identified. UUID v7 IDs are fine in metric labels (they don't leak meaning per sql-architect §1); raw
auth_token_idis not.
Configure your SDK redaction list at startup; review it on every deploy.
8. SLOs and golden signals
What to measure for every service.
RED — for request-driven services
| Signal | Question | Metric |
|---|---|---|
| Rate | How many requests/sec? | <svc>_http_requests_total over time |
| Errors | How many fail? | `<svc>_http_reque |