Five prompting styles were evaluated against an identical task: fix eight planted bugs in a small Express API. Each style was run three times. Outputs were scored on correctness (40%), security (15%), restraint (15%), code quality (20%, via two independent blind LLM judges), and cost (10%).
The spec-driven prompt won decisively. It was the only style to achieve a perfect correctness score (11/11 hidden tests) on every run, the most consistent (std dev 0.02 vs 4.47 for terse), and the cheapest in absolute tokens (278k average vs 821k for TDD). All three spec-driven runs scored ³99.94/100.
Chain-of-thought rose to second place with 95.16/100 — a noticeable jump over the previous result. Asking the agent to "list every issue ranked by severity, then fix in order" produced consistent 10/11 results at the same token cost as a plain detailed brief.
TDD-framed remained the weakest of the structured styles. It cost more than 2× any other structured style for fewer bugs found (mean 8.33/11 vs 10/11 for CoT and brief). Process scaffolding has real overhead and did not pay off at this task size, even with full ability to run the test suite.
Terse prompts again produced clean code but missed bugs. Every terse run scored 5/5 on quality from blind judges, but only 5–7 of 11 hidden tests passed. Brevity transferred all responsibility for bug-discovery to the model, and the model didn't always rise to it.
Headline takeaway: for fixing bugs in a known codebase with Claude Opus 4.7, write a real spec. It produces the best results, the most consistent results, and at the lowest token cost.
SEE ATTACHED STUDY REPORT!



