I Had AI Stamp Out Templates in Bulk — But Couldn’t Judge Which Was Good
Auto-generation worked, but self-verification didn’t — the prerequisite for autonomous loops
The plan was ambitious.
Have it freely generate topics, stamp out multiple template versions from those topics, and run a loop where Claude verifies and improves the results itself. If all three stages ran automatically, all I’d need to do was stand at the gate at the beginning and end.
Generation worked. Stamping out multiple versions worked.
I built a skill pipeline — template-spec → template-gen → template-lint. Spec nails down the category and style specification, gen generates a single HTML file from that spec, lint scores the rendered result. The first template came out. “Aurora” — a dark aurora-mood SaaS landing page.
Then I looked at it — and my hand stopped.
“It feels like I’ve seen this somewhere. But the real problem is I can’t evaluate it.”
The reason was actually very simple. I don’t know design. Not knowing what “improved” meant was the core issue. Multiple versions came flooding out, but I couldn’t say which was good design. “Did this get better or just different?” — I couldn’t tell the difference.
I told it to “improve.” Results came back. Still couldn’t tell if it was different or better. The loop was spinning without going anywhere.
This made one thing clear. Things where the evaluation criteria can be broken into a checklist work completely differently from things that can’t.
When designing template-lint, I split evaluation into two layers. One was things a program can verify — does it have render errors, does it create horizontal scrolling, does it have external dependencies? This is a checklist. Pass/fail is clear and Claude can run it itself. In practice, checking Aurora’s critical flaws was perfectly automated.
The other layer was “is it visually good?” Claude looked at a screenshot directly and gave it 90 points using a rubric. A number came out. But for “fix this to make it better” to work, I’d need to be able to tell Claude which direction to change it so the score goes up. I don’t know design, so I can’t give that direction. And the evaluator — Claude — loses direction when it scores something it made itself.
That’s where the prerequisite for autonomous loops revealed itself.
The conclusion came in one line.
The prerequisite for autonomy is verifiable criteria. If the criteria are subjective, the autonomous loop won’t run.
If you want AI to do something autonomously, you first need to be able to write “what good looks like” as a checklist. If you can’t write that — like design aesthetics — you can have it generate, but you can’t have it verify and improve. Autonomy comes from evaluation criteria, not generation power.
So I changed my next move. If I can’t make the judgment — bring the judgment from somewhere else.