ADR-0026: LLM-driven HTML enrichment with agent file-writing contract

Status

Proposed

Tags

ai, llm, docs, astro, plans, generation, agentic-cli

Decision

For the apps/plans spec review system, HTML enrichment is delegated to an agentic LLM CLI (claude / codex / gemini) invoked by an orchestration script. The script does not transform markdown into HTML itself. Specifically:
  • The script (frontend/astro/apps/plans/scripts/generate-plan-html.ts) builds a prompt that inlines the source markdown, the component catalog, and the Zod schema (auto-rendered as a TypeScript type definition), then shells out to the chosen CLI.
  • The agent writes the output files directly to known paths using its native Read/Write tools — apps/plans/src/content/specs/<slug>.json and the sibling <slug>.html. The script does not parse agent stdout for structured output.
  • After the agent returns, the script reads the files, validates the JSON against the Zod schema, stamps script-produced fields (source_hash, generated_at, generated_by, etc.), and atomically promotes the temp artifacts to the final location. Validation failure → no files in the final location are touched.
  • The CLI choice is pluggable via --with=<cli> flag (default claude) so the workflow does not depend on any single CLI being available.
  • A SHA-256 of the source markdown is stored on the JSON; default behavior is skip if hash matches to avoid burning tokens on unchanged specs. --force overrides.

Why

The instinctive design is a deterministic markdown-to-HTML transformer (remark + custom renderer + templates). This is wrong for the spec-review use case for three compounding reasons:
  1. Deterministic templates cannot produce the visual restructuring this workflow requires. The reference page at docs/superpowers/specs/2026-05-22-fieldforce-phase4-briefings-design.html reshapes the source markdown: foregrounds decisions, summarizes prose into key-value blocks, picks which sections deserve callouts, infers which content benefits from a comparison grid vs. a timeline vs. a flow diagram. A template can only render the structures the markdown already exposes. An agentic LLM picks the right structure per section.
  2. The work fits agentic CLIs better than completion APIs. claude / codex / gemini are designed to read inputs, reason about them, and call tools — including writing files. Asking them to emit a 50KB JSON blob to stdout fights their grain: stdout interleaves thinking, tool calls, and prose, and disciplined JSON output requires brittle prompting. Letting the agent write files directly matches the tools’ shape.
  3. Validation is clearer post-hoc than mid-stream. With agent-writes-file, the script’s verification is a tight loop: existsSync(path) && schema.parse(readFileSync(path)). A failure points at exactly one artifact. Stdout-parsing requires extracting fenced blocks from agent output that may include planning text, recoverable errors, mid-process tool calls — much harder to fail cleanly.
Rejected alternatives:
  • Deterministic Node/TS script (remark + custom renderer + templates). Cheap, reproducible, no token cost, no API key. Rejected: cannot produce the visual restructuring the §1 goal requires — only renders patterns the markdown already exposes.
  • Hybrid: deterministic shell + LLM body enrichment. The shell handles archive/search/TOC/layout deterministically; the LLM enriches the body. This is essentially the chosen design (the layout supplies chrome; the agent authors only body_html), but with the LLM call inside a Node script that calls an SDK rather than shelling out to an agentic CLI. Rejected because the agentic CLIs already handle file writing, retries, and tool orchestration — replicating that scaffolding in-script is duplication.
  • Agent emits JSON to stdout; script parses. Rejected for the stdout-discipline problem above. JSON in tool output is reliable; JSON in agentic CLI stdout is not.
  • Single hardcoded CLI. Rejected because we don’t yet know which CLI produces the best output for this content. Pluggable lets us A/B without code changes.
  • Build-time generation as an Astro integration. Rejected because (a) specs are created a few times per week, not on every build, and burning tokens on every astro build is wasteful; (b) build-time non-determinism is a footgun for CI.

How it works

human edits docs/superpowers/specs/2026-05-25-foo.md


pnpm --filter @astro/plans generate docs/superpowers/specs/2026-05-25-foo.md

   ▼ orchestration script
   │   1. Read source markdown
   │   2. Compute SHA-256 of source bytes
   │   3. If <slug>.json exists and source_hash matches → skip (unless --force)
   │   4. Render Zod schema → TypeScript type definition string
   │   5. Compose prompt from prompts/generate-plan.md with {{VAR}} substitution
   │   6. spawn CLI: `claude -p "<prompt>"` (or codex/gemini equivalent)

   ▼ agent
   │   - Reads no extra files (everything inlined in prompt)
   │   - Writes <tmp>/specs/<slug>.json + <tmp>/specs/<slug>.html
   │   - Writes diagram images to <tmp>/spec-images/<slug>/ if applicable
   │   - Exits

   ▼ script post-validation
   │   1. readFileSync(<tmp>/specs/<slug>.json) → JSON.parse → Zod.parse
   │   2. Verify sibling .html file exists
   │   3. Stamp script-produced fields onto the JSON
   │   4. Atomic rename: <tmp>/* → apps/plans/src/content/specs/*
   │   5. Rebuild search-index.json from full collection

   ▼ skill (in conversation) surfaces:
       - "Generated <slug>"
       - "N suggested components" (if any)
       - Path to view in browser
The script lives at frontend/astro/apps/plans/scripts/generate-plan-html.ts and is invoked via pnpm --filter @astro/plans generate .... The prompt template lives at frontend/astro/apps/plans/prompts/generate-plan.md.

Known limitations

  • Non-determinism across runs. Two runs over the same markdown produce different HTML. This is acceptable because the workflow is human-review-then-commit — the human sees the result before it ships. For specs where the output diverges in undesirable ways, the remedy is to tighten the prompt, not to make generation deterministic.
  • Token cost. Each generation re-renders the entire body. A 46KB markdown spec like Phase 4 briefings produces a multi-thousand-token agent run. Hash-based skip mitigates re-runs; --force --all is the expensive operation and exists for deliberate use only.
  • CLI quirks. claude -p, codex exec, and gemini have different prompt invocation conventions, different auth flows, and different rate-limit behavior. The orchestration script isolates these in one CLI-dispatch function so they don’t leak into the prompt template.
  • Validation failures are workflow blocks. If the agent produces invalid JSON, the script aborts and no files are promoted. The human must re-run (potentially with a different CLI, or with --force) to recover. This is intentional — a half-broken artifact in src/content/specs/ is worse than no artifact.
  • No streaming output to the reviewer. The reviewer sees the page only after generation completes. For very large specs this can be 30–60 seconds of silence. Acceptable for a few-per-week workflow.

Rules for agents

  • The orchestration script MUST NOT call LLM provider SDKs directly. It only shells out to agentic CLIs.
  • The script MUST validate agent output with the shared Zod schema before promotion. No “trust the agent” path.
  • The script MUST use atomic rename (or equivalent) for promotion. Partial writes never reach the content collection.
  • The prompt template MUST inline source markdown, component catalog, and the Zod schema as a TypeScript type. The agent MUST NOT be expected to read auxiliary files itself — keep its inputs in the prompt for reproducibility and hashability.
  • The agent MUST NOT invent CSS classes, write <style> tags, or use inline style="" attributes. See ADR-0027 for the closed component vocabulary that constrains this.
  • The agent’s tool calls during generation are not audit-logged. The artifact is the artifact; provenance is generated_by + generated_at on the JSON.

Bad pattern (do not generate)

// In-script deterministic enrichment:
const html = remark()
  .use(remarkParse)
  .use(remarkRehype)
  .use(rehypeStringify)
  .processSync(markdown);
fs.writeFileSync(outputPath, html); // wrong: no visual restructuring, no decision foregrounding

// Or: parse agent stdout for JSON:
const out = execSync('claude -p "...generate..."').toString();
const jsonBlock = out.match(/```json\n([\s\S]+?)\n```/)?.[1]; // wrong: stdout discipline is fragile
const data = JSON.parse(jsonBlock!);

Good pattern

const { tmpDir } = await prepareTempWorkspace(slug);
const prompt = composePrompt({
  template: readFileSync(PROMPT_PATH, 'utf8'),
  sourceMarkdown: readFileSync(specPath, 'utf8'),
  catalog: readFileSync(CATALOG_PATH, 'utf8'),
  schemaAsTypescript: renderZodAsType(SpecSchema),
  outputJsonPath: path.join(tmpDir, `${slug}.json`),
  outputHtmlPath: path.join(tmpDir, `${slug}.html`),
  diagramMode,
  imagesEnabled,
});
await runCli(cli, prompt); // spawns claude/codex/gemini; agent writes files directly

const raw = JSON.parse(readFileSync(path.join(tmpDir, `${slug}.json`), 'utf8'));
const validated = SpecSchema.parse(raw); // throws on schema violation → no promotion
const finalJson = stampScriptFields(validated, { sourceHash, generatedBy: cli, /* … */ });

await atomicPromote(tmpDir, FINAL_DIR); // rename, never partial write
await rebuildSearchIndex();