HN Debate: LLM Coding Works Better When Acceptance Criteria Come First
Original: LLMs work best when the user defines their acceptance criteria first View original →
Why Hacker News amplified this post
On March 7, 2026, the Hacker News thread around Katana Quant's post drew strong attention because it turned a vague complaint about "AI code quality" into a measurable engineering failure. The source article, Your LLM Doesn't Write Correct Code. It Writes Plausible Code., benchmarked a ground-up LLM-generated Rust rewrite of SQLite against SQLite itself. For one of the simplest operations, a 100-row primary-key lookup, SQLite took 0.09 ms while the rewrite took 1,815.43 ms, or 20,171x slower.
That number mattered because the rewritten project was not obviously broken. It compiled, passed tests, claimed file-format compatibility, and looked architecturally complete. The point of the article was that these surface signals are not enough. A system can look correct, and even return correct outputs on small checks, while still violating the performance and algorithmic invariants that make the software usable in production.
Where the failure came from
The technical diagnosis is precise. In SQLite, INTEGER PRIMARY KEY acts as an alias to the internal rowid, so a lookup such as WHERE id = 5 should hit a direct B-tree search. Katana Quant showed that the Rust reimplementation's planner only recognized literal names such as rowid, _rowid_, and oid. As a result, named primary-key lookups fell back to full table scans instead of the expected logarithmic path.
The article also highlighted compounding overheads: repeated schema reloads, recompilation, page copies, and expensive sync behavior outside batched transactions. None of those choices look dramatic in isolation. Together, they turn a plausible reimplementation into a database that misses the core behavior users actually depend on.
What the HN discussion adds
The HN comments broadened the lesson beyond databases. Several readers noted that frontier coding models are most fragile on underspecified or unfamiliar tasks, especially when the user has not stated what "correct" means in measurable terms. That maps directly to the post's central recommendation: define acceptance criteria before generation. In practice, that means specifying latency budgets, algorithmic expectations, correctness checks, regression tests, and the tooling that will verify them.
For teams adopting coding agents, this is the operational takeaway. Prompt quality matters, but verification design matters more. If a developer cannot explain why a query planner should choose a B-tree seek instead of a full scan, the model's confident output will not close that gap. Community interest on HN suggests the discussion is shifting from demo-friendly generation toward process discipline: benchmark first, state invariants early, and treat LLM output as draft engineering work until it survives explicit tests.
Original source: Your LLM Doesn't Write Correct Code. It Writes Plausible Code.
Related Articles
HN read Kimi K2.6 as a test of whether open-weight coding agents can last through real engineering work. The 12-hour and 13-hour coding cases drew attention, while commenters immediately pressed on speed, provider accuracy, and benchmark realism.
HN did not latch onto DeepSeek V4 because of a polished launch page. The thread took off when commenters realized the front-page link was just updated docs while the weights and base models were already live for inspection.
Hacker News focused less on the Copilot plan mechanics and more on what the change reveals: long-running coding agents are turning flat AI subscriptions into a compute-cost problem.
Comments (0)
No comments yet. Be the first to comment!