ADR-0023: Lease-then-LLM ordering for cron-driven AI generation
Status
AcceptedTags
cron, ai, idempotency, multi-replica, fieldforce, briefingDecision
When a cron-triggered pipeline performs expensive non-idempotent work (LLM calls, AccessGate usage records, third-party API calls) before persisting a uniqueness-guarded result, the pipeline MUST claim a database-side lease as its first step:- INSERT a placeholder row with
generation_mode='pending', empty payload, andgenerated_at=now(). AUNIQUEconstraint on the natural key (e.g.(org_id, briefing_date)) settles races — only one replica wins per logical unit of work. - Replicas that lose the INSERT (UNIQUE violation) skip the org entirely. No LLM call. No AccessGate row. No wasted spend.
- The winning replica proceeds to gather → score → LLM → finalize. Finalize is an
UPDATEagainst the leased row, not a second INSERT. - If the winner crashes between claim and finalize, the row is left in
pending. A recovery clause on later ticks (WHERE generation_mode='pending' AND generated_at < now() - 5min) re-claims and retries.
Why
A naive “do the work, then INSERT with UNIQUE” pattern relies on the constraint to discard losing replicas — but only after every replica has paid for the LLM call and recorded usage against the org’s plan. At N replicas this multiplies cost N×, distorts AccessGate metrics, and silently exceeds plan caps. The lease-first pattern moves the race-resolution point to a cheap INSERT, so only one replica pays for the expensive work. The cost is one extra UPDATE per successful generation and one column (pending mode) that must be handled in the read path and recovery clause.
A Postgres advisory lock keyed on (org_id, date) would solve the same problem, but introduces a lock lifecycle to manage on crashes, and the lock is invisible in the table itself — harder to debug and observe than a pending row you can SELECT.
Leader election (one replica designated as the cron runner) would also solve it, but adds coordination infrastructure (etcd / advisory locks at the process level) for a feature that doesn’t otherwise need it, and concentrates cron failure on one node.
Rejected alternatives:
- Single-replica deployment. Forces an operational constraint to plug an architectural hole. Rejected — production should scale freely.
- Advisory locks. Adds invisible lifecycle state. Rejected.
- Leader election. Overkill for a once-per-day job. Rejected.
- LLM-first, then INSERT. The naive baseline. Rejected for the reasons above.
How it works
For a cron-triggered generator producing one row per(scope_id, period):
Known limitations
- A
pendingrow is briefly visible to read queries between claim and finalize. Read paths must filterWHERE generation_mode <> 'pending'(or treatpendingas “not yet ready”). - Recovery timeout (5 min) is a heuristic. Too short → racy re-claims under slow LLM responses. Too long → user-visible delay after a worker crash. 5 min comfortably exceeds expected LLM latency and remains within “user notices something is off” tolerance.
- The recovery clause assumes the LLM call is safe to retry — i.e., idempotent or already-paid-for via AccessGate denial. AccessGate.RecordUsage MUST NOT be called twice for the same lease retry; this is enforced by recording usage only on a successful UPDATE-to-
llmfinalize.
Rules for agents
- Any new cron + LLM pipeline MUST follow lease-then-LLM.
- The lease row’s
generation_modeMUST start as'pending'. - The read path MUST filter
pendingrows (or treat them as in-flight). - Crash recovery MUST use a stamp-based recovery clause, not a separate “abandoned leases” table.