Event-Driven Architecture

Patterns for services that communicate via asynchronous events. Broker-agnostic at the conceptual level; defaults inline per use case (NATS for lightweight, Kafka for log-based replay, RabbitMQ for command queues). Protobuf schemas for events (protobuf-architect) so contracts get the same buf breaking discipline as gRPC. Outbox pattern is mandatory for any write that emits an event. Schemas, SQL, and tooling shapes in RECIPES.md; pinned brokers and client libs in STACK.md.

1. Event vs message vs command — the three shapes

Shape	Direction	Semantics	Example
Event	Past-tense, broadcast	"Something happened" — fact about the past, anyone can listen	`OrderPlaced`, `PaymentCaptured`
Command	Imperative, point-to-point	"Do this" — request to one specific handler	`CancelOrder`, `SendEmail`
Message	Generic envelope	Container for either — used when the distinction doesn't matter	(mostly an implementation detail)

Events are immutable past-tense facts. OrderPlaced was placed; nothing changes that. Subscribers react however they want.
Commands have one intended handler. Multiple handlers reacting to a command is almost always wrong — it's an event in disguise. Rename it.
Naming: events are <Noun><PastVerb> (OrderPlaced, ShipmentDispatched); commands are <Verb><Noun> (PlaceOrder, SendShipment).
Choose the shape per use case, not per technology. Both Kafka and RabbitMQ can carry either; the discipline is in the schema and contract.

2. Schema — Protobuf

Per protobuf-architect: events are .proto messages, code-generated, validated by protovalidate, and protected from breaking changes by buf breaking in CI.

Envelope contract (every event):

event_id — UUID v7, sortable + unique. Subscribers dedupe on it.
occurred_at — RFC 3339 timestamp. Replay tools sort by this.
aggregate_id — the entity the event is about. Drives partitioning.
schema_version — integer; bump on additive changes inside a topic version.

Payload discipline:

Minimal. IDs and the few facts subscribers need — not the full aggregate state. Subscribers fetch via sql-architect repositories. Big payloads make schema evolution and replay expensive.
Field numbers reserved on delete per protobuf-architect §3. Never reuse.
One file per resource's events — orders/v1/events.proto holds every event the orders context emits.

Canonical schema in RECIPES §1.

3. Topic / subject naming

Hierarchical, snake_case, versioned. Pick a convention and enforce it.

<org>.<context>.<resource>.<version>.<event_name>

Matches the Buf-style package path from protobuf-architect.
Version is part of the topic name, not just the schema. New major version → new topic. Run in parallel until consumers migrate.
Lowercase + dots (NATS, Pub/Sub) or lowercase + underscores (Kafka). Pick one for your broker and stick to it.
Document the catalog somewhere queryable — Confluent Schema Registry, BSR, or a simple events.md in the repo.

Examples in RECIPES §5.

4. Outbox pattern — mandatory for "DB write + event emit"

The dual-write problem: an HTTP handler writes a row and publishes an event. If the DB commits but the broker rejects, the event is lost — silent inconsistency. If the broker accepts but the DB rolls back, subscribers process a phantom event.

Outbox fixes this with a single transactional write.

Handler BEGIN TX → INSERT INTO aggregate → INSERT INTO outbox → COMMIT.
Separate publisher reads unpublished rows, sends them to the broker, marks them published.

Outbox is a regular table in the same DB as the aggregate. The write is atomic with the business write.
Publisher is separate — a goroutine, a sidecar, a cron, or CDC (Debezium reading Postgres WAL). CDC is the most robust; goroutine is fine for small services.
At-least-once delivery — the same event can be republished if the publisher crashes between PUBLISH and UPDATE. Consumers must be idempotent (§6).
Outbox table grows — partition or purge published rows older than 7–30 days.
Why mandatory: there is no working pattern that avoids both the lost-event and phantom-event failure modes without the outbox. Anything else (publish-before-commit, publish-after-commit) is broken under failure.

Schema + publisher shapes in RECIPES §2.

5. Ordering and partitioning

Event ordering is the single hardest part of event-driven systems. Order is per-key, not global.

Order is preserved within a partition / subject, but not across partitions. Kafka partitions by message key; NATS via subject hierarchy; RabbitMQ via consistent-hash exchanges.
Partition key is the aggregate ID. All events for order_id=abc land on the same partition, processed in order by one consumer. Different orders process in parallel.
Globally-ordered events are a smell. If you "need" global order, you actually need a single consumer (and you've lost scaling), or you're modeling the domain wrong.
Consumers process one partition at a time per instance. Concurrent processing within a partition breaks ordering. Most client libs handle this; verify your config.

6. Idempotency — consumer must dedupe

Brokers deliver at least once. Consumers see the same event more than once under network failure, restart, or rebalance.

Dedupe by event_id. Each consumer keeps a small store (Redis with TTL, or a processed_events table) of recently-seen IDs. Reject duplicates.
Idempotent side effects — design the handler so a duplicate is harmless: INSERT ... ON CONFLICT DO NOTHING, UPDATE ... WHERE version = ? (with optimistic concurrency).
TTL on dedupe store — events older than the broker's retention can't be replayed anyway.
Exactly-once illusion: idempotent consumer + at-least-once delivery = "effectively exactly-once" from the business perspective. Don't chase true exactly-once at the protocol level — far more expensive than just making consumers idempotent.

Concrete handler + dedupe table in RECIPES §3.

7. Dead-letter queues (DLQs)

Some events can't be processed — schema mismatch, downstream service down too long, business invariant violation. Don't let them block the partition.

Every consumer has a DLQ. A topic / subject named <original>.dlq receives messages the consumer gave up on.
Retry policy first, DLQ second. N attempts with exponential backoff (typical: 3 attempts), then DLQ.
DLQs are monitored. Per observability-architect: a Prometheus counter <svc>_dlq_messages_total with an alert on any non-zero value. A DLQ that quietly fills is a silent outage.
DLQ tooling — operator scripts to inspect, replay, or discard. Both audited.

Retry policy + tool CLI shape in RECIPES §4.

8. Backpressure

When the consumer can't keep up with the producer, the system needs to slow down — gracefully.

Prefetch / consumer concurrency limits. Don't let one consumer instance buffer 10,000 in-flight messages.
Lag-based autoscaling. Watch consumer-group lag; scale out the consumer pool when lag grows.
Reject upstream when persistently overloaded — return 503 Service Unavailable with Retry-After per rest-api-architect §3. Better than building a backlog you can't drain.
**No

event-driven-architect

Como adicionar

Cole no README do seu repo

Skills relacionadas

webapp-testing

brand-guidelines

frontend-design

mcp-builder

Receba novas skills de Design e Frontend toda segunda