I Spent a Day Rebuilding My Own Diagnostic. Here's the Build Log.

On a long enough timeline, every marketer becomes their own narrator.

I shipped a Diagnostic on the stoopidproof variant of the site and showed it around. The verdict screen quoted a 93% precision and a 0% false positive rate, sitting next to the four-tier output. Neither number existed. I had not run a single labeled case through the scorer. I had not reserved a holdout. The fifth axis read a free-text answer with a regex match and assigned 0 or 100 with no middle, and the regex was tuned to the same examples that filled the placeholder marketing copy.

You can ship a Diagnostic like that. I did. The site looks confident. The stat carries the verdict. The work the buyer is reading was never done.

The Problem

This is the part nobody admits. The tool's confidence is the writer's confidence dressed in a font. I wrote a Diagnostic the way you write a deck: I picked the axes that sounded right, wrote the rule the way the demo case wanted to read, and put a stat on the verdict screen because the verdict screen needed a stat. The same way an agency picks the case study before they pick the framework. The demo got tuned in the same room the rule got tuned, which is exactly the conflict the calibration is supposed to remove.

The 93% and the 0% were not estimates. They were not even rounded. They came from a comment I had written six weeks earlier next to a stub function that returned the constant 1. The function got deleted. The comment got pasted into the marketing copy. The number stayed up next to the verdict for three weeks while I worked on other parts of the site, and I read the verdict screen at least a dozen times before I noticed.

That is the part to sit with. I was reading the screen and not seeing the lie. The lie was the kind I would catch on someone else's site in fifteen seconds. The site was mine, so the screen rendered as a verified claim and the verified claim got the trust the rest of the page borrowed from it.

The Guide

A diagnostic that posts a number is making a claim under §201(g) of the FDA's structure-function logic, under FTC 16 CFR 255, and under the older common-law rule about a number stated as a fact. The number does not get a discount because it's about marketing instead of medicine. If the number is wrong, the page is wrong. If the number is fabricated, the page is fraudulent.

So the rebuild has to fix three things in this order. The numbers come from labeled cases. The cases are reserved before the weights are tuned. The meta-prompt that governs the rebuild is locked before the working agent reads any data.

The third one is the one nobody writes. Without it, the agent rebuilding the Diagnostic will reach into the holdout the same way the agency reaches into the case study, and the holdout will stop being a holdout the first time the agent scores below the acceptance bar. Calibration without a process lock is just tuning the rule until it likes the verdict. The 93% goes back up. The work is the same work.

The Plan: 7 Phases, One Meta-Prompt That Cannot Be Edited Mid-Run

I locked a 7-phase shape. Phase 0 is the meta-prompt. The other six are the rebuild itself.

Research (Phase 1). Seven parallel research subagents read primary sources. Trout/Ries and Lochhead on positioning and king-category structure. Geoffrey Moore on chasm-stage. Sharp and Romaniuk on category entry points and brand-coding. Hormozi on offer-mechanic. Miller on internal-problem narrative. A regulated-DTC subagent reads FDA §201(g) and 21 CFR 700 and FTC 16 CFR 255. A seventh subagent tags the case database with each axis. One independent reviewer reads the same primary sources via web search and grades the seven without trusting their summaries.

Architecture (Phase 2). A six-section spec: axis list, weights, thresholds, the verdict-tier shape, the explain-don't-assume gloss list, and the refusal-conditions. One reviewer.

Wording audit (Phase 3). Each Diagnostic question gets drafted in four voices and read by five personas. The persona reads happen against the question, not the verdict. The voice that wins is the voice the five personas read the same way.

Calibration (Phase 4). The case database holds 43 cases. Eleven are reserved as a holdout. The holdout case_ids are case_id 3, case_id 7, case_id 11, case_id 15, case_id 19, case_id 23, case_id 27, case_id 31, case_id 35, case_id 39, and case_id 43. They were drawn with seed=42 against the §0.2 method written into the case database itself. The 32 training cases tune the weights. The holdout validates the tuned weights. Re-tuning against the holdout is not allowed. A @ts-expect-error assertion at compile time fails the build if the calibration script imports a holdout case_id as a training row.

Blue/red team (Phase 5). Six named red-team attacks. Regulatory false-positive, multi-axis conflict, demo-bias leakage, base-rate inversion, holdout contamination, and the version of the attack I ran on myself when I noticed the 93% number. The patch lands in the same phase. Convergence requires a clean run against every attack.

Build (Phase 6). Replace Diagnostic.tsx with the calibrated scorer. The scorer is pure TypeScript with no LLM at runtime. Phase 6 touches only Diagnostic.tsx plus the new scorer.ts in _lib/. A print-diff gate fails the phase if any other file moved.

Present (Phase 7). A four-tier verdict page that names the axis weights, links the primary sources, and shows the holdout score. The Principal Advisor reviewer reads the verdict against the brief one more time and asks the six questions.

What Phase 0 Actually Produced

Phase 0 wrote one document: the meta-prompt that governs every subsequent phase. Two parallel reviewers read the v1 meta-prompt. The frameworks reviewer flagged seven amendments: the Trout/Ries citation split, the Moore citation split, the Play Bigger cohort caveat (the 2000-2015 US VC-tech window, which I had been quoting without the window), an explicit reference to FDA §201(g) and 21 CFR 700, a regulatory false-positive attack added to Phase 5, a multi-axis-conflict attack added to Phase 5, and a tightening of the Phase 1 reviewer brief. The prompt-engineering reviewer flagged eight: per-phase absolute paths in Step 0, the save-point format referenced per phase, Sinek and explain-don't-assume hardwired into Phase 3 and Phase 6, a Phase 6 print-diff gate, the Phase 3 reviewer reading Miller and Trout/Ries primary sources, the refusal-conditions across the phase dependency gates, em-dashes stripped from the meta-prompt prose, and the Phase 7 Principal Advisor stance.

Fifteen amendments. They landed in the v2 meta-prompt. The stack reviewer read the seven phase files and added three polish amendments: Phase 1's reviewer sequencing clarified, Phase 7's Star-folder mirror file list corrected, and Phase 4's @ts-expect-error named as the compile-time mechanism for holdout-leakage.

The locked artifacts: META-PROMPT.md v2, the seven phase files 01-research.md through 07-present.md, three reviewer outputs, a NOTES.md entry, and the save point. The 93%/0% stat is permanently dead across the program folder. The phrase "naming underwriting" appears with the verbatim SSOT §2A first-use definition pinned to it. The case_id holdout list appears in eight places across the stack, never in a tuning-context location.

What Could Still Break

The honest list.

The case database is mine. I labeled the 43 cases. The labels could be wrong in the same way the regex was wrong. The Phase 1 reviewer reads the cases against primary sources, but a reviewer that does not have the field experience that produced the original labels cannot grade them perfectly. The fix is to publish the case-DB with the labels and let outside readers find the mislabeled ones. That is a Phase 8 problem.

The 11-case holdout is small. A holdout of 11 against a 43-case database gives wide confidence intervals on the precision and recall numbers Phase 4 will produce. The number that goes on the verdict screen will say "11-case holdout, seed=42, precision X ±Y" with the interval shown. The interval will not look as clean as a fabricated 93%. That is the point.

The meta-prompt cannot anticipate every reviewer's mood. Two reviewers in Phase A.5 found 15 amendments. A third reviewer would have found more. The convergence bar is not "every reviewer agrees" but "the reviewers stop finding amendments that change the structural shape." I cannot prove I am there. I can show the diff and let the reader judge.

The pipeline that produced this post is a 5-persona magazine board. Five reviewer agents read the draft, score it on rubric, and force a rewrite if any axis fails. The post you are reading passed that pipeline. The pipeline can fail closed (a post that passes the score is still bad) or fail open (a post that fails the score is actually good and gets rewritten into something worse). The board catches Issue #1's class of fabrication. Issue #2's class is still on me to catch.

Read those four and the picture is honest. The work is better than the work I shipped three weeks ago. The work is not done.

If you want the receipts, the program folder is at prompts/diagnostic-rebuild-2026-05-11/ in the Paper St repo, the case database is at canon/category-creation/Category_Creation_Case_Database_2026-04-30.md, and the save point this post draws from is SAVE-POINT-DIAGNOSTIC-REBUILD-PHASE-0-2026-05-11.md. The holdout method is in §0.2 of the case DB. Pull the seed and rerun it if you want.