HN Debate: LLM Coding Works Better When Acceptance Criteria Come First
Original: LLMs work best when the user defines their acceptance criteria first View original →
Why Hacker News amplified this post
On March 7, 2026, the Hacker News thread around Katana Quant's post drew strong attention because it turned a vague complaint about "AI code quality" into a measurable engineering failure. The source article, Your LLM Doesn't Write Correct Code. It Writes Plausible Code., benchmarked a ground-up LLM-generated Rust rewrite of SQLite against SQLite itself. For one of the simplest operations, a 100-row primary-key lookup, SQLite took 0.09 ms while the rewrite took 1,815.43 ms, or 20,171x slower.
That number mattered because the rewritten project was not obviously broken. It compiled, passed tests, claimed file-format compatibility, and looked architecturally complete. The point of the article was that these surface signals are not enough. A system can look correct, and even return correct outputs on small checks, while still violating the performance and algorithmic invariants that make the software usable in production.
Where the failure came from
The technical diagnosis is precise. In SQLite, INTEGER PRIMARY KEY acts as an alias to the internal rowid, so a lookup such as WHERE id = 5 should hit a direct B-tree search. Katana Quant showed that the Rust reimplementation's planner only recognized literal names such as rowid, _rowid_, and oid. As a result, named primary-key lookups fell back to full table scans instead of the expected logarithmic path.
The article also highlighted compounding overheads: repeated schema reloads, recompilation, page copies, and expensive sync behavior outside batched transactions. None of those choices look dramatic in isolation. Together, they turn a plausible reimplementation into a database that misses the core behavior users actually depend on.
What the HN discussion adds
The HN comments broadened the lesson beyond databases. Several readers noted that frontier coding models are most fragile on underspecified or unfamiliar tasks, especially when the user has not stated what "correct" means in measurable terms. That maps directly to the post's central recommendation: define acceptance criteria before generation. In practice, that means specifying latency budgets, algorithmic expectations, correctness checks, regression tests, and the tooling that will verify them.
For teams adopting coding agents, this is the operational takeaway. Prompt quality matters, but verification design matters more. If a developer cannot explain why a query planner should choose a B-tree seek instead of a full scan, the model's confident output will not close that gap. Community interest on HN suggests the discussion is shifting from demo-friendly generation toward process discipline: benchmark first, state invariants early, and treat LLM output as draft engineering work until it survives explicit tests.
Original source: Your LLM Doesn't Write Correct Code. It Writes Plausible Code.
Related Articles
A high-traction Hacker News thread highlighted Simon Willison’s "Agentic Engineering Patterns" guide, which organizes practical workflows for coding agents. The focus is operational discipline: testing-first loops, readable change flow, and reusable prompts.
A user created a fully playable space exploration game using only natural language instructions to Gemini 3.1 Pro over a few hours. The AI handled performance optimization, soundtrack generation, and UI design entirely from plain language requests, producing around 1,800 lines of HTML code.
A LocalLLaMA thread spotlights FlashAttention-4, which reports up to 1605 TFLOPs/s on B200 BF16 and introduces pipeline and memory-layout changes tuned for Blackwell constraints.
Comments (0)
No comments yet. Be the first to comment!