ADR-0026: LLM-driven HTML enrichment with agent file-writing contract

Status

Proposed

Decision

For the apps/plans spec review system, HTML enrichment is delegated to an agentic LLM CLI (claude / codex / gemini) invoked by an orchestration script. The script does not transform markdown into HTML itself. Specifically:

The script (frontend/astro/apps/plans/scripts/generate-plan-html.ts) builds a prompt that inlines the source markdown, the component catalog, and the Zod schema (auto-rendered as a TypeScript type definition), then shells out to the chosen CLI.
The agent writes the output files directly to known paths using its native Read/Write tools — apps/plans/src/content/specs/<slug>.json and the sibling <slug>.html. The script does not parse agent stdout for structured output.
After the agent returns, the script reads the files, validates the JSON against the Zod schema, stamps script-produced fields (source_hash, generated_at, generated_by, etc.), and atomically promotes the temp artifacts to the final location. Validation failure → no files in the final location are touched.
The CLI choice is pluggable via --with=<cli> flag (default claude) so the workflow does not depend on any single CLI being available.
A SHA-256 of the source markdown is stored on the JSON; default behavior is skip if hash matches to avoid burning tokens on unchanged specs. --force overrides.

Why

The instinctive design is a deterministic markdown-to-HTML transformer (remark + custom renderer + templates). This is wrong for the spec-review use case for three compounding reasons:

Deterministic templates cannot produce the visual restructuring this workflow requires. The reference page at docs/superpowers/specs/2026-05-22-fieldforce-phase4-briefings-design.html reshapes the source markdown: foregrounds decisions, summarizes prose into key-value blocks, picks which sections deserve callouts, infers which content benefits from a comparison grid vs. a timeline vs. a flow diagram. A template can only render the structures the markdown already exposes. An agentic LLM picks the right structure per section.
The work fits agentic CLIs better than completion APIs. claude / codex / gemini are designed to read inputs, reason about them, and call tools — including writing files. Asking them to emit a 50KB JSON blob to stdout fights their grain: stdout interleaves thinking, tool calls, and prose, and disciplined JSON output requires brittle prompting. Letting the agent write files directly matches the tools’ shape.
Validation is clearer post-hoc than mid-stream. With agent-writes-file, the script’s verification is a tight loop: existsSync(path) && schema.parse(readFileSync(path)). A failure points at exactly one artifact. Stdout-parsing requires extracting fenced blocks from agent output that may include planning text, recoverable errors, mid-process tool calls — much harder to fail cleanly.

Rejected alternatives:

Deterministic Node/TS script (remark + custom renderer + templates). Cheap, reproducible, no token cost, no API key. Rejected: cannot produce the visual restructuring the §1 goal requires — only renders patterns the markdown already exposes.
Hybrid: deterministic shell + LLM body enrichment. The shell handles archive/search/TOC/layout deterministically; the LLM enriches the body. This is essentially the chosen design (the layout supplies chrome; the agent authors only body_html), but with the LLM call inside a Node script that calls an SDK rather than shelling out to an agentic CLI. Rejected because the agentic CLIs already handle file writing, retries, and tool orchestration — replicating that scaffolding in-script is duplication.
Agent emits JSON to stdout; script parses. Rejected for the stdout-discipline problem above. JSON in tool output is reliable; JSON in agentic CLI stdout is not.
Single hardcoded CLI. Rejected because we don’t yet know which CLI produces the best output for this content. Pluggable lets us A/B without code changes.
Build-time generation as an Astro integration. Rejected because (a) specs are created a few times per week, not on every build, and burning tokens on every astro build is wasteful; (b) build-time non-determinism is a footgun for CI.

How it works

human edits docs/superpowers/specs/2026-05-25-foo.md
   │
   ▼
pnpm --filter @astro/plans generate docs/superpowers/specs/2026-05-25-foo.md
   │
   ▼ orchestration script
   │   1. Read source markdown
   │   2. Compute SHA-256 of source bytes
   │   3. If <slug>.json exists and source_hash matches → skip (unless --force)
   │   4. Render Zod schema → TypeScript type definition string
   │   5. Compose prompt from prompts/generate-plan.md with {{VAR}} substitution
   │   6. spawn CLI: `claude -p "<prompt>"` (or codex/gemini equivalent)
   │
   ▼ agent
   │   - Reads no extra files (everything inlined in prompt)
   │   - Writes <tmp>/specs/<slug>.json + <tmp>/specs/<slug>.html
   │   - Writes diagram images to <tmp>/spec-images/<slug>/ if applicable
   │   - Exits
   │
   ▼ script post-validation
   │   1. readFileSync(<tmp>/specs/<slug>.json) → JSON.parse → Zod.parse
   │   2. Verify sibling .html file exists
   │   3. Stamp script-produced fields onto the JSON
   │   4. Atomic rename: <tmp>/* → apps/plans/src/content/specs/*
   │   5. Rebuild search-index.json from full collection
   │
   ▼ skill (in conversation) surfaces:
       - "Generated <slug>"
       - "N suggested components" (if any)
       - Path to view in browser

The script lives at frontend/astro/apps/plans/scripts/generate-plan-html.ts and is invoked via pnpm --filter @astro/plans generate .... The prompt template lives at frontend/astro/apps/plans/prompts/generate-plan.md.

Known limitations

Non-determinism across runs. Two runs over the same markdown produce different HTML. This is acceptable because the workflow is human-review-then-commit — the human sees the result before it ships. For specs where the output diverges in undesirable ways, the remedy is to tighten the prompt, not to make generation deterministic.
Token cost. Each generation re-renders the entire body. A 46KB markdown spec like Phase 4 briefings produces a multi-thousand-token agent run. Hash-based skip mitigates re-runs; --force --all is the expensive operation and exists for deliberate use only.
CLI quirks. claude -p, codex exec, and gemini have different prompt invocation conventions, different auth flows, and different rate-limit behavior. The orchestration script isolates these in one CLI-dispatch function so they don’t leak into the prompt template.
Validation failures are workflow blocks. If the agent produces invalid JSON, the script aborts and no files are promoted. The human must re-run (potentially with a different CLI, or with --force) to recover. This is intentional — a half-broken artifact in src/content/specs/ is worse than no artifact.
No streaming output to the reviewer. The reviewer sees the page only after generation completes. For very large specs this can be 30–60 seconds of silence. Acceptable for a few-per-week workflow.

Rules for agents

The orchestration script MUST NOT call LLM provider SDKs directly. It only shells out to agentic CLIs.
The script MUST validate agent output with the shared Zod schema before promotion. No “trust the agent” path.
The script MUST use atomic rename (or equivalent) for promotion. Partial writes never reach the content collection.
The prompt template MUST inline source markdown, component catalog, and the Zod schema as a TypeScript type. The agent MUST NOT be expected to read auxiliary files itself — keep its inputs in the prompt for reproducibility and hashability.
The agent MUST NOT invent CSS classes, write <style> tags, or use inline style="" attributes. See ADR-0027 for the closed component vocabulary that constrains this.
The agent’s tool calls during generation are not audit-logged. The artifact is the artifact; provenance is generated_by + generated_at on the JSON.

Bad pattern (do not generate)

// In-script deterministic enrichment:
const html = remark()
  .use(remarkParse)
  .use(remarkRehype)
  .use(rehypeStringify)
  .processSync(markdown);
fs.writeFileSync(outputPath, html); // wrong: no visual restructuring, no decision foregrounding

// Or: parse agent stdout for JSON:
const out = execSync('claude -p "...generate..."').toString();
const jsonBlock = out.match(/```json\n([\s\S]+?)\n```/)?.[1]; // wrong: stdout discipline is fragile
const data = JSON.parse(jsonBlock!);

Good pattern

const { tmpDir } = await prepareTempWorkspace(slug);
const prompt = composePrompt({
  template: readFileSync(PROMPT_PATH, 'utf8'),
  sourceMarkdown: readFileSync(specPath, 'utf8'),
  catalog: readFileSync(CATALOG_PATH, 'utf8'),
  schemaAsTypescript: renderZodAsType(SpecSchema),
  outputJsonPath: path.join(tmpDir, `${slug}.json`),
  outputHtmlPath: path.join(tmpDir, `${slug}.html`),
  diagramMode,
  imagesEnabled,
});
await runCli(cli, prompt); // spawns claude/codex/gemini; agent writes files directly

const raw = JSON.parse(readFileSync(path.join(tmpDir, `${slug}.json`), 'utf8'));
const validated = SpecSchema.parse(raw); // throws on schema violation → no promotion
const finalJson = stampScriptFields(validated, { sourceHash, generatedBy: cli, /* … */ });

await atomicPromote(tmpDir, FINAL_DIR); // rename, never partial write
await rebuildSearchIndex();

0026 llm driven enrichment with agent file writing contract

ADR-0026: LLM-driven HTML enrichment with agent file-writing contract

Status

Tags

Decision

Why

How it works

Known limitations

Rules for agents

Bad pattern (do not generate)

Good pattern

​ADR-0026: LLM-driven HTML enrichment with agent file-writing contract

​Status

​Tags

​Decision

​Why

​How it works

​Known limitations

​Rules for agents

​Bad pattern (do not generate)

​Good pattern

ADR-0026: LLM-driven HTML enrichment with agent file-writing contract

Status

Tags

Decision

Why

How it works

Known limitations

Rules for agents

Bad pattern (do not generate)

Good pattern