Built llm-cost-per-outcome.vercel.app in 4 hours on Gemini 3.5 Flash — and the 6 issues we caught in 10 minutes of review
TL;DR
The prior post (Pillar 3) was about AgentNoah's AUDIT-side BYOL — five LLMs scoring identically on OWASP. This one is about the BUILD-side. Same bet, opposite operation: can a workhorse-tier LLM, guided by AgentNoah's 16-phase BUILD methodology via MCP, produce a public artifact you wouldn't be embarrassed to ship? We ran the experiment on ourselves.
The experiment: on 2026-05-23 Saturday evening Manila, we built a public free LLM cost calculator (10 LLMs × 6 task types × 4 chart types) by pasting one /goal prompt into Google Antigravity (the IDE LLM = Gemini 3.5 Flash, workhorse tier, $1.50/$9 per Mtok). Antigravity called AgentNoah BUILD MCP, advanced through 15 of the 16 phases (recall was skipped — spec was already concrete), wrote ~1,200 lines of TypeScript across 13 files, and reported PROCEED on every phase in BUILD_LOG.md.
The catches: Day 1 human-in-the-loop review against the BUILD_LOG.md + file diffs found 6 issues Flash 3.5 had self-graded PROCEED — broken down honestly: 2 spec violations where Flash 3.5 ignored requirements our prompt was explicit about (missing per-cell source field; duplicate calculation logic the TDD-REFACTOR phase should have consolidated) + 4 spec gaps where our prompt was ambiguous or missing a requirement (identical per-model token data, uniform retry rate, OG image text, shallow test coverage). Zero pure hallucinations — no fabricated imports, no fake authority claims. Iteration loop fixed all 6 via a second /goal. A polish pass caught 2 more. A local npm install gate caught 1 peer-dep killer that would have died on Vercel. A close-out pass caught a stale PR vs main divergence that would have reverted the revision if merged. Total catches: 10.
| # | Catch | Severity | Caught at | Antigravity self-audit verdict |
|---|---|---|---|---|
| F1 | Identical per-model token data · Spec gap | Critical | Day 1 review | PROCEED |
| F2 | qualityScore unsourced; contradicted our OWASP K=3 data · Spec violation | Critical | Day 1 review | PROCEED |
| F3 | Uniform defaultRetryRate: 1.3 across all 6 tasks · Spec gap | High | Day 1 review | PROCEED |
| F4 | OG image text didn't match the tool (wrong formula, wrong framing) · Spec gap | Medium | Day 1 review | PROCEED |
| F5 | Shallow test coverage (5 structural tests, no behavioral) · Spec gap | Medium | Day 1 review | PROCEED |
| F6 | Duplicate calculation logic (calculator.ts + llms.ts) · Spec violation | Low | Day 1 review | PROCEED |
| P1 | Stale “1.3x (Default)” slider tick label | Low | Polish review | PROCEED |
| P2 | Misleading source: 'aider' badge (not actually from Aider per-cell) | Medium | Polish review | PROCEED |
| B1 | lucide-react@^0.395.0 peer-dep conflict with React 19 RC | Critical | Local build gate | Phase 12 ci PROCEED (CI never ran) |
| B2 | Stale PR #1 vs main divergence (PR would have reverted revision + polish) | Critical | Close-out pass | Phase 14 pr PROCEED (PR opened on a worktree branch; later commits went to main) |
Severity = impact if shipped (Critical = ship blocker, High = embarrassment on launch, Medium = credibility cost with technical readers, Low = polish). Category = root cause: “Spec violation” means the /goalprompt was explicit and Flash 3.5 ignored it; “Spec gap” means our prompt was ambiguous or missing a requirement (founder owns this). Verified by reading the actual /goal prompt at plans/LLM_COST_PER_OUTCOME_REPO_PLAN_2026-05-23.md against BUILD_LOG.md output line-by-line. “Antigravity self-audit verdict” = the verbatim verdict Flash 3.5 wrote into BUILD_LOG.md for the phase that should have caught the issue. All 10 catches passed the in-pipeline SELF_AUDIT + REVIEW phases. The catches came from human-in-the-loop review. Zero pure hallucinations (no fake import statements, no fabricated authority claims).
The argument:AgentNoah BUILD doesn't auto-catch issues. The methodology makes them inspectable: every phase writes a verdict + files- touched + what-I-learned line to BUILD_LOG.md; every commit is scoped to the phase that produced it; cross-audit memory pins the provenance contract. That's what the human reviewer reads. A 10-minute scan against the artifacts surfaced 10 catches a workhorse-tier LLM had self-graded as PROCEED. Same model can build trustworthy public code; the methodology is what makes the review tractable.
The result: live at llm-cost-per-outcome.vercel.app. Repo public at github.com/guevae2/llm-cost-per-outcome. 4 commits on main (init + squashed v0.1 BUILD + polish + peer-dep fix). 9 Jest tests pass. npm run build passes locally (6 static routes generated, TypeScript strict mode clean). Total wall time: ~4 hours founder-driven across 3 iteration passes. Cost to AgentNoah: $0 inference (BYOL via Antigravity's free tier of Gemini Code Assist). Cost to Vercel: $0 (free tier).
Not bad for a workhorse-tier one-shot at $1.50/$9 per Mtok — 20-30× cheaper than frontier. Now imagine the same methodology with a frontier-tier model: same structured passes, probably fewer catches, same 10-minute review surface. The methodology is the moat. Workhorse with BUILD = 6 catches in 10 minutes. Frontier without the methodology = 1,200 lines to scan in any order with no log to consult.
Why we built this via our own product
AgentNoah needed a public showcase repo. Pre-customer-#1, the only third-party-verifiable artifact we had was the Pillar 3 OWASP benchmark blog. Useful for technical readers, but no inbound “here's a thing I built with this” social proof.
We picked the most self-referential test we could think of: build a tool that demonstrates AgentNoah's pricing argument (“cheap per token ≠ cheap per outcome”) by USING AgentNoah's own product to build it. If the BUILD methodology is real, the tool exists. If it isn't, we have a public failure on the founder's GitHub.
Hard mode: use a workhorse-tier LLM (Gemini 3.5 Flash, $1.50/$9 per Mtok), not a frontier model. The Pillar 3 data showed Flash 3.5 hitting Youden 1.000 on OWASP — perfect AUDIT. But building a Next.js + Tailwind + Recharts app with strict TypeScript, WCAG AA fallbacks, provenance discipline per cell, and a monthly pricing-drift CI workflow is much more complex than security audit. We expected issues. We wanted the methodology to surface them.
The IDE host: Google Antigravity with AgentNoah MCP enabled (the same Antigravity + first-party Gemini pairing Pillar 3 used for Flash 3.5 audits). Single /goal prompt; Antigravity called mcp__agentnoah-remote-smoke__start_byol_build; advanced through 15 of the 16 BUILD phases; pushed to GitHub and reported done.
Method (the literal /goal prompt)
Step 1. Created an empty public GitHub repo at github.com/guevae2/llm-cost-per-outcome (MIT-licensed, empty README). Cloned to C:\Dev\llm-cost-per-outcome. Added it to the AgentNoah dashboard so the daemon could index it.
Step 2. Opened Antigravity in the cloned folder. Confirmed the AgentNoah MCP was loaded — 30 tools registered (mcp__agentnoah-remote-smoke__* namespace including start_byol_build, advance_byol_build, and audit).
Step 3. Pasted the /goal prompt — single block, ~80 lines, fully self-contained (project spec, 10 LLM list, 6 task categories, 4 chart types, strict TypeScript requirement, WCAG AA fallback requirement, monthly pricing-drift CI requirement, the literal BUILD methodology phase list to march through, and a BUILD_LOG.md output template). Full prompt text lives in plans/LLM_COST_PER_OUTCOME_REPO_PLAN_2026-05-23.md.
Step 4.Antigravity ran autonomously. Wall time observable in Antigravity's phase log: roughly 30 minutes from "brainstorm" to "learn". No confirmation prompts; no safety-gate interruptions (consistent with the Pillar 3 finding that Antigravity + first-party Gemini doesn't gate the way Antigravity + GPT OSS does). At the end, Flash 3.5 wrote a walkthrough.md to its internal .gemini/antigravity-ide/brain/ folder and announced completion.
The pairing for review: founder paired Antigravity (build LLM = Gemini 3.5 Flash) with Claude (Anthropic, in a separate Claude Code session reading the same local repo via file access) as the review LLM. This is BYOL working as designed — customers can use one IDE LLM for everything OR mix-and-match. We mixed because we wanted an independent reviewer LLM that hadn't generated the code. Both LLMs guided by the same AgentNoah methodology — cross-audit memory, provenance discipline, the 16-phase BUILD contract.
Review inputs: BUILD_LOG.md (Flash's per-phase verdicts); the file diffs against main; the verification commands the runbook specified (npm install, npm run build, npm test); the GitHub PR state.
The 6 issues (Day 1 review)
F1 — Identical per-model token data (Spec gap · Critical)
What we found: in lib/llms.ts, every one of 6 task categories had identical inputTokens + outputTokens values across all 10 LLM cells. Unit-test: 4000/1500 for every model. Security-audit: 12000/3000 for every model. Same shape across all 60 cells.
Why it's fatal: the calculator's argument is “cheap per token ≠ cheap per outcome.” If every model uses the same tokens, the cost difference is justper-token price × retry rate. The “outcome” framing is meaningless. A skeptical HN reader spots this in 30 seconds and tweets.
Antigravity's verdict on the phase that should have caught this: Phase 5 (generate) verdict PROCEED; “What I learned: Authored the full application code in type-safe TypeScript, wrapping the Recharts visualization components in a mounted hydration check to ensure smooth SSR delivery without viewport mismatches.” Nothing about provenance discipline; nothing about per-model variation. SELF_AUDIT (Phase 7) also reported PROCEED.
F2 — Unsourced qualityScore + contradicted our own OWASP K=3 data (Spec violation · Critical)
What we found: every LLM in lib/llms.ts had a bare qualityScore number with no qualityScoreSource field. The numbers were plausible-shaped (Opus 86.8, Sonnet 85.2, Flash 3.5 78.9) but unattributable. Worse: Flash 3.5 in this list was below Sonnet (78.9 vs 85.2). The Pillar 3 OWASP K=3 blog we published the day before has Flash 3.5 at 1.000 σ=0 — beating Sonnet's 0.821 ± 0.094 — on the security-audit task specifically.
Why it's fatal: anyone who reads both AgentNoah blogs gets whiplash. Internal consistency is a credibility primitive.
Antigravity's verdict on the phase that should have caught this: Phase 8 (review) verdict PROCEED; “Senior Code Reviewer successfully approved the code structure and type definitions.” The review phase apparently didn't cross-check data claims against published evidence — not in its scope as written.
F3 — Uniform 1.3× retry rate across all 6 tasks (Spec gap · High)
What we found: every task category in TASK_CATEGORIES had defaultRetryRate: 1.3. The spec called for variation (1.0 for frontier on simple, up to 2.5 for cheap on complex). With uniform 1.3, the retry-rate slider on the home page does nothing useful per-task; the Retry Rate Sensitivity chart is flat across categories.
F4 — OG image text claims didn't match the tool (Spec gap · Medium)
What we found: Antigravity generated a branded OG card (1200×630 PNG at public/og-image.png). Visually polished. The text on it: “OPTIMIZE ROI · EVALUATE PERFORMANCE · ANALYZE CPO” with the formula CPO = ΣCᵢ / ΣOᵢ and bottom labels “TOTAL LLM COST / OUTCOMES (SUCCESSES) / COST-PER-OUTCOME.”
The tool's actual formula is (in_tokens × $/M + out_tokens × $/M) × retry_rate. The tool doesn't measure "outcomes (successes)"; it estimates per-task cost. The OG image was a different product's OG image with the right title pasted on top.
Why it matters: social shares are the first 5 seconds a stranger sees. If the OG image promises one product and the live page delivers a different product, you lose them in the first click.
F5 — Shallow test coverage (Spec gap · Medium)
What we found: Antigravity's claim: “Wrote 5 robust Jest unit tests verifying all 10 models, 6 categories, math precision, edge cases, and cell provenance.” Actual contents of __tests__/calculator.test.ts: 5 tests, all structural — test 1 counted models, test 2 counted categories, test 3 was a single happy-path math calculation, test 4 tested 0×/3.0× edge cases, test 5 verified the provenance field existed per cell (not that the values were correct). Zero behavioral coverage across the 60 actual (LLM × task) cells. Zero sort-function tests. Zero UI tests.
Why it matters: “5 tests pass” reads like coverage to a casual reader. A reviewer who opens the test file in 30 seconds sees it's 5 box-ticks. Real coverage is what we want; box-ticks are what Flash 3.5 wrote.
F6 — Duplicate calculation logic (Spec violation · Low)
What we found: lib/calculator.ts exported calculateCost(...) with primitive args. lib/llms.ts exported calculateOutcomeCost(...) with model/task IDs that internally did the same math. Two functions, same formula, both kept. The test imported both.
Phase 7 SELF_AUDIT and Phase 11 TDD_REFACTOR both reported PROCEED on this. Refactoring duplicated logic is what refactor phases are FOR.
Iteration loop
We didn't open a separate PR for the fixes. We fired a second /goal back into Antigravity — explicit spec for each of F1–F6 with target file paths, exact value ranges, and provenance constraints. Antigravity ran the fix cycle in ~20 minutes. Outcome:
- F1 fixed: per-model token estimates now vary in
lib/llms.ts(Opus 4.7 unit-test: 4200/2200; Haiku 4.5 unit-test: 3800/1300; o3-mini unit-test: 4500/2500). Still plausible-shaped estimates calibrated to Aider leaderboard patterns rather than measured per-cell — see polish P2 for the honest disclosure landing on the About page. - F2 fixed:
qualityScoreSourcefield added to theLLMModelinterface. Anthropic + OpenAI models point to Aider Leaderboard; Google + DeepSeek point to LMSys Arena. A newSECURITY_AUDIT_QUALITY_OVERRIDESobject overrides the bare scores for the security-audit task using our published OWASP K=3 numbers: Opus 4.7 = 100.0, Flash 3.5 = 100.0, Sonnet 4.6 = 82.1, Pro 3.1 = 80.2, Flash 3 = 75.0. Override source URL points back to/blog/3-model-byol-evidence— the calculator now CITES the AUDIT blog as its security-audit data source. - F3 fixed: task retry rates differentiated: unit-test 1.1, security-audit 1.5, pr-summary 1.1, generate-docs 1.2, debug-stack 1.6, refactor-func 1.4. Task selector dropdown now auto-updates the slider to the new default on task change.
- F4 fixed: OG image regenerated with accurate text: “Cheap per token ≠ Cheap per outcome,” formula
(in × $/M + out × $/M) × retry, bottom labels “10 LLMs · 6 Task Types · 4 Charts.” - F5 fixed: test suite expanded from 5 → 9 with real behavioral coverage: 12-cell calculation correctness, OWASP override verification, sort algorithm validation, single-entry edge case, model-lookup-by-id.
- F6 fixed:
lib/calculator.tsdeleted; single source of truth atlib/llms.ts:calculateOutcomeCost.
A polish pass landed two more fixes: P1(stale “1.3x (Default)” slider tick that didn't update on task change — now reads {activeCategory?.defaultRetryRate.toFixed(1)}x (Task Default)) and P2 (the About page now explicitly says cells tagged aider are calibrated to Aider leaderboard patterns, NOT direct per-cell measurements; cells tagged agentnoah-owasp come from our K=3 benchmark). Honesty about what "aider" means here was the right fix; the alternative was changing the badge to "estimated" which loses the calibration signal.
A local-build verification gate caught B1 before Vercel saw it: lucide-react@^0.395.0 declares peer dep React 16–18; Next.js 16 ships React 19 RC; npm install fails with ERESOLVE. Bumped to ^1.16.0 (which adds React 19 to peer deps). Would have killed the Vercel build had we merged without running npm install locally.
A close-out pass caught B2: Antigravity had opened PR #1 from a worktree branch (agentnoah/build/cfa7b75a-3365f0) during Phase 14 (pr), then pushed ALL subsequent revision + polish + fix commits directly to main. PR #1 stayed frozen at the Day 1 broken state. Merging PR #1 would have created a giant merge conflict AND reverted main back to the pre- revision (bad data) state. We discovered this when comparing the local main HEAD against PR #1's reported HEAD via the GitHub MCP. PR #1 to be closed manually as superseded.
Limits we want you to know about
- Flash 3.5 self-graded every phase PROCEED. The 10 catches did NOT come from the in-pipeline SELF_AUDIT + REVIEW + DEBUG phases. They came from a separate human- in-the-loop review (founder + Claude Sonnet 4.6 as reviewer LLM, paired) reading BUILD_LOG.md + file diffs against the live code. We don't claim the methodology auto-catches issues. The claim is narrower: the methodology makes them inspectable so a 10-minute scan finds them. Without the methodology you'd still find them; you'd just have to read all 1,200 lines of code in any order. With it you scan BUILD_LOG.md first, then audit specific files the log mentions.
- We paired two IDE LLMs. Antigravity + Gemini 3.5 Flash built; Claude (in a separate Claude Code session) reviewed. AgentNoah supports BYOL with whatever you use; in practice for this case study we mixed because we wanted an independent reviewer-LLM. A solo customer using one LLM for both is supported but is more likely to miss self-generated issues (the model has seen its own output). The Pillar 3 cross-model κ data suggests cross-vendor independence is non-trivial; the build-vs-review independence question is open and worth measuring.
- Workhorse tier, single run. We didn't K=3 the build. We ran Flash 3.5 once, iterated three times to fix, shipped. Whether Flash 3.5 would fabricate the same 6 issues on a fresh attempt is untested. The Pillar 3 K=3 data showed Flash 3.5 producing byte-identical OWASP verdicts across 3 runs (σ=0). Whether build output is similarly deterministic is open. Plan to measure if 3 customers replicate.
- Token estimates are still estimates. Even after F1 fix, the per-model token counts in
lib/llms.tsare plausible-shaped calibrationsto Aider leaderboard patterns, not per-cell measurements. We're honest about this on the About page; community PRs to refine specific cells with real measurements are explicitly invited via GitHub Issues. The argument the tool makes is directional (frontier models use more tokens for thorough responses; cheap models compress), not precise. - 15 of 16 phases ran in this instance. BUILD_LOG.md phases 1–15 explicitly named (brainstorm, plan, worktree, tdd_red, generate, tdd_green, self_audit, review, debug, fix, tdd_refactor, ci, branch_finish, pr, learn). RECALL was skipped — likely merged into brainstorm given the spec was already concrete. The product's documented methodology is 16 phases; in this instance 15 ran. Not a measurement bug; an honesty point for readers comparing AgentNoah marketing copy to the actual log.
- PR #1 still open at publication time. Our GitHub MCP integration returned 403 on the PR-close attempt (insufficient scope on personal-repo write operations). To be closed manually via the GitHub web UI. The PR's body still describes the Day 1 broken state with the “5 passing tests” line — a permanent public-facing record of what got caught + fixed. We chose to keep the worktree branch around for audit-trail honesty rather than force-delete it.
- Honest cost ledger. The "~4 hours" wall time includes founder oversight + Antigravity build time + review wall time + iteration wall time + Vercel deploy time. It does NOT include the planning that produced the
/goalprompt or the writing of this blog post. Founder hands-on time was closer to 90 min spread across checkpoints. Inference cost across Antigravity + Claude review sessions: $0 to AgentNoah because both LLMs were the founder's existing IDE subscriptions. If a customer paid out-of- pocket via API instead, the inference would still be trivial — workhorse-tier model on the build side, longer review-session on the audit side, both well within the $20/month range a single dev would spend on Anthropic + Google API credits. - The 2 spec-violation vs 4 spec-gap categorization is founder's judgment, not an audit. We re-read the actual
/goalprompt atplans/LLM_COST_PER_OUTCOME_REPO_PLAN_2026-05-23.mdline-by-line against each catch and labeled it “spec violation” (prompt was explicit; Flash 3.5 ignored it) or “spec gap” (prompt was ambiguous or missing; founder owns this). A skeptical reader could argue some “spec gaps” should have been obvious to Flash 3.5 from context, or that some “spec violations” are debatable about how explicit is explicit-enough. Reasonable. The bigger honesty point: zero pure hallucinations — no fake import statements, no fabricated authority claims, no invented identifiers. Flash 3.5's mistakes were all attributable to prompt clarity vs prompt ambiguity, not to confabulation.
Try the calculator. Try AgentNoah BUILD.
The calculator itself: llm-cost-per-outcome.vercel.app. Source: github.com/guevae2/llm-cost-per-outcome (MIT). Star it if you find it useful; PR new models or refined token estimates via the Issues tab.
AgentNoah BUILD itself is live on the same MCP that ran this experiment. Free 14-day trial (no credit card). Sign in with GitHub at agentnoah.dev, install the MCP snippet in Claude Code / Cursor / VS Code Copilot / Gemini CLI / Google Antigravity (~5-line JSON config in your IDE's MCP settings), and ask your IDE LLM to “use AgentNoah BUILD to build [your feature spec here].”
If you want the full /goalprompt we used as a starting template, it's in plans/LLM_COST_PER_OUTCOME_REPO_PLAN_2026-05-23.md — happy to send via email agentnoah.dev@gmail.com with subject “BUILD goal prompt template.”
If you'd rather chat about it: the Discord community is open. Live uptime monitor: agentnoah.betteruptime.com.
What this proves and what it doesn't
What it proves:a workhorse-tier LLM ($1.50/$9 per Mtok) can produce a public, functional, accessible, SEO-correct Next.js + Tailwind app with strict TypeScript and a real test suite when guided by AgentNoah's 16-phase BUILD methodology — IF a human reviewer actually runs the 10-minute scan against BUILD_LOG.md. The methodology's contribution is making the inevitable issues inspectable in 10 minutes rather than buried across 1,200 lines.
What it doesn't prove: that the methodology auto-catches issues without human review; that Flash 3.5 would produce the same output on a second attempt; that solo-LLM workflows (build + review on the same model) would work as cleanly as the build/review-pair we used here. All three are open questions worth measuring.
Now imagine the same experiment with a frontier-tier model. We haven't run that yet — open question, planning to measure once 3 customers replicate. But Pillar 3's K=3 AUDIT data already showed Flash 3.5 matching Opus 4.7 at Youden 1.000 on OWASP — the tier gap on review tasks is smaller than the 20-30× pricing gap implies. Whether frontier-tier reduces the number of BUILD catches needed, or just changes which catches surface, is still open. Either way, the methodology held up here at workhorse-tier — and the methodology is what makes the review pass tractable regardless of which model you bring.
That's what AgentNoah BUILD sells: structured iteration, not a magic AI judge. Workhorse with BUILD = 6 catches surfaced in a 10-minute review. Frontier with BUILD = probably fewer catches, same 10-minute review surface. Frontier withoutthe methodology = 1,200 lines to scan in any order with no per-phase log to consult. $39/mo flat + your existing IDE subscription (Claude Code / Cursor / Antigravity / Gemini CLI / VS Code Copilot). $0 markup on inference because we don't run it — BYOL means your IDE LLM does the work, AgentNoah provides the 16-phase scaffold + cross-audit memory + the inspectable BUILD_LOG.md every phase writes into.
Pair this with the Pillar 3 evidence: the AUDIT-side BYOL evidence (5 LLMs, 2 languages, 10 measurements, frontier + workhorse + workhorse-new) is in Pillar 3. This BUILD-side case study is the other half: same methodology, opposite operation (write vs read). Same empirical commitment: ship the artifact, show the data, disclose the limits — including the limits of THIS case study's framing (see Limit #8 above).
Thanks for reading. Open the calculator, fork the repo, file issues. Happy to answer anything in the Discord or by email.
— Edward Guevarra, founder, AgentNoah