ADR-0003: Transactional outbox pattern for guaranteed event delivery

Status

Accepted

Tags

messaging, nats, events, reliability, distributed-systems, outbox

Decision

Use the transactional outbox pattern for all critical event publishing. Events are written to an outbox_events table in the same database transaction as the business write, then a background relay worker publishes them to NATS JetStream.

Why

Without the outbox pattern, publishing to NATS after a DB write creates a dual-write risk: if NATS is down or the app crashes between the two operations, the event is permanently lost while the DB record exists. The outbox pattern eliminates this by making the event write part of the same transaction. If the transaction rolls back, the event is never created. If NATS is down, the relay worker retries until it succeeds.

How it works

Business write + outbox_events INSERT (same transaction, atomic)

Relay worker polls outbox_events every 1s

Publishes to NATS JetStream

Marks event as published (or increments retry counter)

Cleanup: published events deleted after 7 days
Retry limit: 3 attempts. After 3 failures the event is logged as a dead letter and stops retrying — ops team investigates.

Trade-offs

Zero data lossEvents survive NATS outages
Automatic retryNo manual recovery
Audit trailFull event history queryable in DB
Eventual consistency1–2 second publish delay
Storage overheadOutbox table grows; needs cleanup job

Rules for agents

  • Always write the outbox event inside the same db.Transaction as the business entity write — never after the transaction commits
  • Use outboxRepo.Save() inside the transaction, never nats.Publish() directly in business use cases
  • The relay worker is the only code that calls nats.Publish()
  • Dead letter events must be logged with event_id, event_type, and aggregate_id so ops can replay them manually
  • Cleanup: call outboxRepo.DeletePublished(ctx, 7) daily

Bad pattern (do not generate)

// Writing to DB, then publishing directly — dual-write risk
db.Save(&user)
nats.Publish("user.created", event) // lost if NATS is down

Good pattern

db.Transaction(func(tx *gorm.DB) error {
    tx.Create(&user)
    return outboxRepo.Save(tx, OutboxEvent{
        EventType:   "user.created",
        AggregateID: user.ID,
        Payload:     mustMarshal(event),
    })
})
// Relay worker publishes asynchronously

When to skip

Non-critical events (page views, analytics) where occasional loss is acceptable may publish directly to NATS without the outbox overhead.