Event-Driven Architecture
Patterns for services that communicate via asynchronous events. Broker-agnostic at the conceptual level; defaults inline per use case (NATS for lightweight, Kafka for log-based replay, RabbitMQ for command queues). Protobuf schemas for events (protobuf-architect) so contracts get the same buf breaking discipline as gRPC. Outbox pattern is mandatory for any write that emits an event. Schemas, SQL, and tooling shapes in RECIPES.md; pinned brokers and client libs in STACK.md.
1. Event vs message vs command — the three shapes
| Shape | Direction | Semantics | Example |
|---|---|---|---|
| Event | Past-tense, broadcast | "Something happened" — fact about the past, anyone can listen | OrderPlaced, PaymentCaptured |
| Command | Imperative, point-to-point | "Do this" — request to one specific handler | CancelOrder, SendEmail |
| Message | Generic envelope | Container for either — used when the distinction doesn't matter | (mostly an implementation detail) |
- Events are immutable past-tense facts.
OrderPlacedwas placed; nothing changes that. Subscribers react however they want. - Commands have one intended handler. Multiple handlers reacting to a command is almost always wrong — it's an event in disguise. Rename it.
- Naming: events are
<Noun><PastVerb>(OrderPlaced,ShipmentDispatched); commands are<Verb><Noun>(PlaceOrder,SendShipment). - Choose the shape per use case, not per technology. Both Kafka and RabbitMQ can carry either; the discipline is in the schema and contract.
2. Schema — Protobuf
Per protobuf-architect: events are .proto messages, code-generated, validated by protovalidate, and protected from breaking changes by buf breaking in CI.
Envelope contract (every event):
event_id— UUID v7, sortable + unique. Subscribers dedupe on it.occurred_at— RFC 3339 timestamp. Replay tools sort by this.aggregate_id— the entity the event is about. Drives partitioning.schema_version— integer; bump on additive changes inside a topic version.
Payload discipline:
- Minimal. IDs and the few facts subscribers need — not the full aggregate state. Subscribers fetch via sql-architect repositories. Big payloads make schema evolution and replay expensive.
- Field numbers reserved on delete per protobuf-architect §3. Never reuse.
- One file per resource's events —
orders/v1/events.protoholds every event the orders context emits.
Canonical schema in RECIPES §1.
3. Topic / subject naming
Hierarchical, snake_case, versioned. Pick a convention and enforce it.
<org>.<context>.<resource>.<version>.<event_name>
- Matches the Buf-style package path from protobuf-architect.
- Version is part of the topic name, not just the schema. New major version → new topic. Run in parallel until consumers migrate.
- Lowercase + dots (NATS, Pub/Sub) or lowercase + underscores (Kafka). Pick one for your broker and stick to it.
- Document the catalog somewhere queryable — Confluent Schema Registry, BSR, or a simple
events.mdin the repo.
Examples in RECIPES §5.
4. Outbox pattern — mandatory for "DB write + event emit"
The dual-write problem: an HTTP handler writes a row and publishes an event. If the DB commits but the broker rejects, the event is lost — silent inconsistency. If the broker accepts but the DB rolls back, subscribers process a phantom event.
Outbox fixes this with a single transactional write.
- Handler
BEGIN TX→INSERT INTO aggregate→INSERT INTO outbox→COMMIT. - Separate publisher reads unpublished rows, sends them to the broker, marks them published.
- Outbox is a regular table in the same DB as the aggregate. The write is atomic with the business write.
- Publisher is separate — a goroutine, a sidecar, a cron, or CDC (Debezium reading Postgres WAL). CDC is the most robust; goroutine is fine for small services.
- At-least-once delivery — the same event can be republished if the publisher crashes between PUBLISH and UPDATE. Consumers must be idempotent (§6).
- Outbox table grows — partition or purge published rows older than 7–30 days.
- Why mandatory: there is no working pattern that avoids both the lost-event and phantom-event failure modes without the outbox. Anything else (publish-before-commit, publish-after-commit) is broken under failure.
Schema + publisher shapes in RECIPES §2.
5. Ordering and partitioning
Event ordering is the single hardest part of event-driven systems. Order is per-key, not global.
- Order is preserved within a partition / subject, but not across partitions. Kafka partitions by message key; NATS via subject hierarchy; RabbitMQ via consistent-hash exchanges.
- Partition key is the aggregate ID. All events for
order_id=abcland on the same partition, processed in order by one consumer. Different orders process in parallel. - Globally-ordered events are a smell. If you "need" global order, you actually need a single consumer (and you've lost scaling), or you're modeling the domain wrong.
- Consumers process one partition at a time per instance. Concurrent processing within a partition breaks ordering. Most client libs handle this; verify your config.
6. Idempotency — consumer must dedupe
Brokers deliver at least once. Consumers see the same event more than once under network failure, restart, or rebalance.
- Dedupe by
event_id. Each consumer keeps a small store (Redis with TTL, or aprocessed_eventstable) of recently-seen IDs. Reject duplicates. - Idempotent side effects — design the handler so a duplicate is harmless:
INSERT ... ON CONFLICT DO NOTHING,UPDATE ... WHERE version = ?(with optimistic concurrency). - TTL on dedupe store — events older than the broker's retention can't be replayed anyway.
- Exactly-once illusion: idempotent consumer + at-least-once delivery = "effectively exactly-once" from the business perspective. Don't chase true exactly-once at the protocol level — far more expensive than just making consumers idempotent.
Concrete handler + dedupe table in RECIPES §3.
7. Dead-letter queues (DLQs)
Some events can't be processed — schema mismatch, downstream service down too long, business invariant violation. Don't let them block the partition.
- Every consumer has a DLQ. A topic / subject named
<original>.dlqreceives messages the consumer gave up on. - Retry policy first, DLQ second. N attempts with exponential backoff (typical: 3 attempts), then DLQ.
- DLQs are monitored. Per observability-architect: a Prometheus counter
<svc>_dlq_messages_totalwith an alert on any non-zero value. A DLQ that quietly fills is a silent outage. - DLQ tooling — operator scripts to inspect, replay, or discard. Both audited.
Retry policy + tool CLI shape in RECIPES §4.
8. Backpressure
When the consumer can't keep up with the producer, the system needs to slow down — gracefully.
- Prefetch / consumer concurrency limits. Don't let one consumer instance buffer 10,000 in-flight messages.
- Lag-based autoscaling. Watch consumer-group lag; scale out the consumer pool when lag grows.
- Reject upstream when persistently overloaded — return
503 Service UnavailablewithRetry-Afterper rest-api-architect §3. Better than building a backlog you can't drain. - **No