ADR-0023: Lease-then-LLM ordering for cron-driven AI generation

Status

Accepted

Tags

cron, ai, idempotency, multi-replica, fieldforce, briefing

Decision

When a cron-triggered pipeline performs expensive non-idempotent work (LLM calls, AccessGate usage records, third-party API calls) before persisting a uniqueness-guarded result, the pipeline MUST claim a database-side lease as its first step:
  1. INSERT a placeholder row with generation_mode='pending', empty payload, and generated_at=now(). A UNIQUE constraint on the natural key (e.g. (org_id, briefing_date)) settles races — only one replica wins per logical unit of work.
  2. Replicas that lose the INSERT (UNIQUE violation) skip the org entirely. No LLM call. No AccessGate row. No wasted spend.
  3. The winning replica proceeds to gather → score → LLM → finalize. Finalize is an UPDATE against the leased row, not a second INSERT.
  4. If the winner crashes between claim and finalize, the row is left in pending. A recovery clause on later ticks (WHERE generation_mode='pending' AND generated_at < now() - 5min) re-claims and retries.
This applies to any future cron + AI pipeline. The first consumer is Phase 4 fieldforce briefings.

Why

A naive “do the work, then INSERT with UNIQUE” pattern relies on the constraint to discard losing replicas — but only after every replica has paid for the LLM call and recorded usage against the org’s plan. At N replicas this multiplies cost N×, distorts AccessGate metrics, and silently exceeds plan caps. The lease-first pattern moves the race-resolution point to a cheap INSERT, so only one replica pays for the expensive work. The cost is one extra UPDATE per successful generation and one column (pending mode) that must be handled in the read path and recovery clause. A Postgres advisory lock keyed on (org_id, date) would solve the same problem, but introduces a lock lifecycle to manage on crashes, and the lock is invisible in the table itself — harder to debug and observe than a pending row you can SELECT. Leader election (one replica designated as the cron runner) would also solve it, but adds coordination infrastructure (etcd / advisory locks at the process level) for a feature that doesn’t otherwise need it, and concentrates cron failure on one node. Rejected alternatives:
  • Single-replica deployment. Forces an operational constraint to plug an architectural hole. Rejected — production should scale freely.
  • Advisory locks. Adds invisible lifecycle state. Rejected.
  • Leader election. Overkill for a once-per-day job. Rejected.
  • LLM-first, then INSERT. The naive baseline. Rejected for the reasons above.

How it works

For a cron-triggered generator producing one row per (scope_id, period):
// Step 1: claim the lease
res, err := db.Exec(`
    INSERT INTO ff_briefings (id, org_id, briefing_date, generation_mode, summary_md, at_risk_count, ...)
    VALUES ($1, $2, $3, 'pending', '', 0, ...)
    ON CONFLICT (org_id, briefing_date) DO NOTHING
`, newID, orgID, localDate)

rows, _ := res.RowsAffected()
if rows == 0 {
    return // another replica owns this org/day or it's already done
}

// Step 2: do the expensive work (LLM, gate, etc.)
result, err := generate(ctx, orgID, localDate)
if err != nil {
    // row stays 'pending'; recovery clause picks it up later
    return
}

// Step 3: finalize via UPDATE
_, err = db.Exec(`
    UPDATE ff_briefings
    SET generation_mode = $1, summary_md = $2, ...,  generated_at = now()
    WHERE id = $3
`, result.Mode, result.Summary, ..., newID)
The recovery clause runs at the start of step 1 on later ticks:
UPDATE ff_briefings
SET generated_at = now()  -- re-stamp so a current replica is the new lease holder
WHERE generation_mode = 'pending'
  AND generated_at < now() - INTERVAL '5 minutes'
RETURNING id, org_id, briefing_date;

Known limitations

  • A pending row is briefly visible to read queries between claim and finalize. Read paths must filter WHERE generation_mode <> 'pending' (or treat pending as “not yet ready”).
  • Recovery timeout (5 min) is a heuristic. Too short → racy re-claims under slow LLM responses. Too long → user-visible delay after a worker crash. 5 min comfortably exceeds expected LLM latency and remains within “user notices something is off” tolerance.
  • The recovery clause assumes the LLM call is safe to retry — i.e., idempotent or already-paid-for via AccessGate denial. AccessGate.RecordUsage MUST NOT be called twice for the same lease retry; this is enforced by recording usage only on a successful UPDATE-to-llm finalize.

Rules for agents

  • Any new cron + LLM pipeline MUST follow lease-then-LLM.
  • The lease row’s generation_mode MUST start as 'pending'.
  • The read path MUST filter pending rows (or treat them as in-flight).
  • Crash recovery MUST use a stamp-based recovery clause, not a separate “abandoned leases” table.

Bad pattern (do not generate)

// LLM-first: every replica pays before UNIQUE rejects all but one
result := callLLM(ctx, ...)
accessGate.RecordUsage(ctx, ...)
_, err := db.Exec(`INSERT INTO ff_briefings (...) VALUES (...) ON CONFLICT DO NOTHING`, ...)
// 2 of 3 replicas just wasted an LLM call

Good pattern

See “How it works” above. The shape is: cheap INSERT → expensive work → UPDATE.