Dirac’s 65.2% TerminalBench run turned HN toward the harness, not just the model

Hacker News did not treat this as a simple brag post. The thread immediately turned into a sharper question: did Dirac win because the model got better, or because the harness wasted less context? The Show HN post said Dirac hit 65.2% on TerminalBench 2 with gemini-3-flash-preview, ahead of Google’s own baseline at 47.6% and Junie CLI at 64.3%. It also stressed that no benchmark-specific AGENTS.md files or other leaderboard tricks were inserted, which is exactly why the discussion got traction.

The Dirac repo frames the project around context discipline. Its README highlights hash-anchored edits, AST-guided scoping, batched file operations, and opportunistic context updates that try to fetch the next needed material before the model asks. The pitch is blunt: if coding agents degrade as context grows, then better curation is not a nice extra. It is the product.

That matched the HN discussion almost perfectly. Early commenters asked whether this was really a new model story or just a new wrapper. The author answered that the model was still the default Gemini 3 Flash Preview and that the gains came from the tool chain. Other commenters dug into why AST-based search might beat plain grep on large repositories, especially when common symbol names and bundled files pollute search results. Community discussion noted that once code search gets noisy, the agent can burn context long before it makes a useful change.

The interesting part is not only that Dirac posted a high number on TerminalBench. It is that HN treated the number as evidence in a larger argument about coding agents. The thread reads like a reminder that model progress and harness design are now entangled. Same base model, different search strategy, different edit strategy, different outcome. That is exactly the kind of argument Hacker News likes to keep alive.

Dirac’s 65.2% TerminalBench run turned HN toward the harness, not just the model

Related Articles

HN notices what made Dirac top TerminalBench: fewer tokens, sharper edits

Cursor publishes the Composer 2 technical report detailing continued pretraining and large-scale RL for coding agents

Copilot pauses sign-ups as agent workloads break plan math

Comments (0)

Leave a Comment