ADR-0025: DB-managed LLM prompts as canonical, no code fallback
Status
AcceptedTags
ai, prompts, llm, fieldforce, briefingDecision
For LLM features that use theai_prompt_templates table, the database is the sole source of truth for the active prompt. There is no code-shipped LLM prompt fallback.
- Code-shipped seed files under
backend/go/internal/modules/ai/seed_prompts/<feature_key>/<locale>.tmplexist only to seed v1 rows during the initial migration. They are not consulted at runtime. - If the DB read fails for any reason (empty active row, missing table, connection error, malformed template), the pipeline MUST log loudly, record an AccessGate error, skip the LLM call, and fall back to the heuristic-only generation path for that tick.
- “Heuristic-only” is the same well-tested branch that fires when AccessGate denies an LLM call (trial exhausted, plan cap, kill-switch). It is not LLM-related; it produces a deterministic templated summary from already-computed metrics.
ai_prompt_templates. The first consumer is Phase 4 fieldforce briefings.
Why
The instinctive design is to ship a code constant as an “emergency fallback” — if the DB is unreachable, the system can still call the LLM with the constant. This is wrong for two compounding reasons:- A stale constant is more dangerous than no LLM call. Once a prompt has been tuned through 3-4 DB versions, the code constant is the original v1 from months ago. When the emergency fallback fires, the LLM produces output the org has never seen — at exactly the moment something else is already broken. Users perceive this as “the AI got worse” rather than “the database is down.”
- The fallback path is not the rare path. A code constant only catches catastrophic DB failure (table missing, connection lost). The far more common failure modes — empty active row after a botched activation, malformed template after an edit, transient query timeout — all need to be handled anyway, and they should all converge on the same well-tested branch. Having two failure paths (code-fallback for catastrophic, heuristic for ordinary) doubles the test surface and the failure-mode reasoning.
- Code constants as emergency LLM fallback. The original design. Rejected for the reasons above.
- DB row as canonical + code constant as identical mirror, kept in sync by PR review. Requires a manual copy-back step after every prompt tuning. Skipped under time pressure → drift → stale fallback. Rejected.
- No seeding at all (admin must create v1 manually). Adds a deploy-blocking manual step. Rejected — bootstrap convenience is real.
How it works
Read path (every LLM call):Known limitations
- A catastrophic DB-prompts outage means all LLM calls for that feature fall to heuristic-only across the platform. This is the intended behavior, but it means observability matters:
fieldforce_briefings_total{mode="heuristic_only"}should be alarmed if it crosses a threshold relative to flag-enabled orgs. - Recovering a corrupted active row requires a platform-admin action (re-activate a prior version), not an automatic code rollback. This is intentional — automatic rollback to code would defeat the whole “DB is canonical” guarantee.
- Disaster recovery from a complete data loss requires re-running the migration’s seed step. Seed files MUST stay current enough that a v1 restore is acceptable as a temporary state until the platform admin restores known-good versions from backup.
Rules for agents
- Features backed by
ai_prompt_templatesMUST NOT define an in-code LLM prompt constant as a runtime fallback. - Features MAY ship
seed_prompts/<feature_key>/<locale>.tmplfiles for migration bootstrap. They MUST NOT be referenced from any runtime code path. - DB prompt read failures MUST fall through to the existing heuristic-only / gate-denied path, never to a code constant.
- Heuristic-only templates (deterministic locale-specific text built from metrics) ARE allowed in code — they are not LLM prompts.