HN Debate: LLM Coding Works Better When Acceptance Criteria Come First

Original: LLMs work best when the user defines their acceptance criteria first View original →

Read in other languages: 한국어日本語
LLM Mar 7, 2026 By Insights AI (HN) 2 min read 4 views Source

Why Hacker News amplified this post

On March 7, 2026, the Hacker News thread around Katana Quant's post drew strong attention because it turned a vague complaint about "AI code quality" into a measurable engineering failure. The source article, Your LLM Doesn't Write Correct Code. It Writes Plausible Code., benchmarked a ground-up LLM-generated Rust rewrite of SQLite against SQLite itself. For one of the simplest operations, a 100-row primary-key lookup, SQLite took 0.09 ms while the rewrite took 1,815.43 ms, or 20,171x slower.

That number mattered because the rewritten project was not obviously broken. It compiled, passed tests, claimed file-format compatibility, and looked architecturally complete. The point of the article was that these surface signals are not enough. A system can look correct, and even return correct outputs on small checks, while still violating the performance and algorithmic invariants that make the software usable in production.

Where the failure came from

The technical diagnosis is precise. In SQLite, INTEGER PRIMARY KEY acts as an alias to the internal rowid, so a lookup such as WHERE id = 5 should hit a direct B-tree search. Katana Quant showed that the Rust reimplementation's planner only recognized literal names such as rowid, _rowid_, and oid. As a result, named primary-key lookups fell back to full table scans instead of the expected logarithmic path.

The article also highlighted compounding overheads: repeated schema reloads, recompilation, page copies, and expensive sync behavior outside batched transactions. None of those choices look dramatic in isolation. Together, they turn a plausible reimplementation into a database that misses the core behavior users actually depend on.

What the HN discussion adds

The HN comments broadened the lesson beyond databases. Several readers noted that frontier coding models are most fragile on underspecified or unfamiliar tasks, especially when the user has not stated what "correct" means in measurable terms. That maps directly to the post's central recommendation: define acceptance criteria before generation. In practice, that means specifying latency budgets, algorithmic expectations, correctness checks, regression tests, and the tooling that will verify them.

For teams adopting coding agents, this is the operational takeaway. Prompt quality matters, but verification design matters more. If a developer cannot explain why a query planner should choose a B-tree seek instead of a full scan, the model's confident output will not close that gap. Community interest on HN suggests the discussion is shifting from demo-friendly generation toward process discipline: benchmark first, state invariants early, and treat LLM output as draft engineering work until it survives explicit tests.

Original source: Your LLM Doesn't Write Correct Code. It Writes Plausible Code.

Share:

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.