ADR-0026: LLM-driven HTML enrichment with agent file-writing contract
Status
ProposedTags
ai, llm, docs, astro, plans, generation, agentic-cliDecision
For theapps/plans spec review system, HTML enrichment is delegated to an agentic LLM CLI (claude / codex / gemini) invoked by an orchestration script. The script does not transform markdown into HTML itself.
Specifically:
- The script (
frontend/astro/apps/plans/scripts/generate-plan-html.ts) builds a prompt that inlines the source markdown, the component catalog, and the Zod schema (auto-rendered as a TypeScript type definition), then shells out to the chosen CLI. - The agent writes the output files directly to known paths using its native Read/Write tools —
apps/plans/src/content/specs/<slug>.jsonand the sibling<slug>.html. The script does not parse agent stdout for structured output. - After the agent returns, the script reads the files, validates the JSON against the Zod schema, stamps script-produced fields (
source_hash,generated_at,generated_by, etc.), and atomically promotes the temp artifacts to the final location. Validation failure → no files in the final location are touched. - The CLI choice is pluggable via
--with=<cli>flag (defaultclaude) so the workflow does not depend on any single CLI being available. - A SHA-256 of the source markdown is stored on the JSON; default behavior is skip if hash matches to avoid burning tokens on unchanged specs.
--forceoverrides.
Why
The instinctive design is a deterministic markdown-to-HTML transformer (remark + custom renderer + templates). This is wrong for the spec-review use case for three compounding reasons:-
Deterministic templates cannot produce the visual restructuring this workflow requires. The reference page at
docs/superpowers/specs/2026-05-22-fieldforce-phase4-briefings-design.htmlreshapes the source markdown: foregrounds decisions, summarizes prose into key-value blocks, picks which sections deserve callouts, infers which content benefits from a comparison grid vs. a timeline vs. a flow diagram. A template can only render the structures the markdown already exposes. An agentic LLM picks the right structure per section. -
The work fits agentic CLIs better than completion APIs.
claude/codex/geminiare designed to read inputs, reason about them, and call tools — including writing files. Asking them to emit a 50KB JSON blob to stdout fights their grain: stdout interleaves thinking, tool calls, and prose, and disciplined JSON output requires brittle prompting. Letting the agent write files directly matches the tools’ shape. -
Validation is clearer post-hoc than mid-stream. With agent-writes-file, the script’s verification is a tight loop:
existsSync(path) && schema.parse(readFileSync(path)). A failure points at exactly one artifact. Stdout-parsing requires extracting fenced blocks from agent output that may include planning text, recoverable errors, mid-process tool calls — much harder to fail cleanly.
- Deterministic Node/TS script (remark + custom renderer + templates). Cheap, reproducible, no token cost, no API key. Rejected: cannot produce the visual restructuring the §1 goal requires — only renders patterns the markdown already exposes.
- Hybrid: deterministic shell + LLM body enrichment. The shell handles archive/search/TOC/layout deterministically; the LLM enriches the body. This is essentially the chosen design (the layout supplies chrome; the agent authors only
body_html), but with the LLM call inside a Node script that calls an SDK rather than shelling out to an agentic CLI. Rejected because the agentic CLIs already handle file writing, retries, and tool orchestration — replicating that scaffolding in-script is duplication. - Agent emits JSON to stdout; script parses. Rejected for the stdout-discipline problem above. JSON in tool output is reliable; JSON in agentic CLI stdout is not.
- Single hardcoded CLI. Rejected because we don’t yet know which CLI produces the best output for this content. Pluggable lets us A/B without code changes.
- Build-time generation as an Astro integration. Rejected because (a) specs are created a few times per week, not on every build, and burning tokens on every
astro buildis wasteful; (b) build-time non-determinism is a footgun for CI.
How it works
frontend/astro/apps/plans/scripts/generate-plan-html.ts and is invoked via pnpm --filter @astro/plans generate .... The prompt template lives at frontend/astro/apps/plans/prompts/generate-plan.md.
Known limitations
- Non-determinism across runs. Two runs over the same markdown produce different HTML. This is acceptable because the workflow is human-review-then-commit — the human sees the result before it ships. For specs where the output diverges in undesirable ways, the remedy is to tighten the prompt, not to make generation deterministic.
- Token cost. Each generation re-renders the entire body. A 46KB markdown spec like Phase 4 briefings produces a multi-thousand-token agent run. Hash-based skip mitigates re-runs;
--force --allis the expensive operation and exists for deliberate use only. - CLI quirks.
claude -p,codex exec, andgeminihave different prompt invocation conventions, different auth flows, and different rate-limit behavior. The orchestration script isolates these in one CLI-dispatch function so they don’t leak into the prompt template. - Validation failures are workflow blocks. If the agent produces invalid JSON, the script aborts and no files are promoted. The human must re-run (potentially with a different CLI, or with
--force) to recover. This is intentional — a half-broken artifact insrc/content/specs/is worse than no artifact. - No streaming output to the reviewer. The reviewer sees the page only after generation completes. For very large specs this can be 30–60 seconds of silence. Acceptable for a few-per-week workflow.
Rules for agents
- The orchestration script MUST NOT call LLM provider SDKs directly. It only shells out to agentic CLIs.
- The script MUST validate agent output with the shared Zod schema before promotion. No “trust the agent” path.
- The script MUST use atomic rename (or equivalent) for promotion. Partial writes never reach the content collection.
- The prompt template MUST inline source markdown, component catalog, and the Zod schema as a TypeScript type. The agent MUST NOT be expected to read auxiliary files itself — keep its inputs in the prompt for reproducibility and hashability.
- The agent MUST NOT invent CSS classes, write
<style>tags, or use inlinestyle=""attributes. See ADR-0027 for the closed component vocabulary that constrains this. - The agent’s tool calls during generation are not audit-logged. The artifact is the artifact; provenance is
generated_by+generated_aton the JSON.