Dirac’s 65.2% TerminalBench run turned HN toward the harness, not just the model
Original: Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview View original →
Hacker News did not treat this as a simple brag post. The thread immediately turned into a sharper question: did Dirac win because the model got better, or because the harness wasted less context? The Show HN post said Dirac hit 65.2% on TerminalBench 2 with gemini-3-flash-preview, ahead of Google’s own baseline at 47.6% and Junie CLI at 64.3%. It also stressed that no benchmark-specific AGENTS.md files or other leaderboard tricks were inserted, which is exactly why the discussion got traction.
The Dirac repo frames the project around context discipline. Its README highlights hash-anchored edits, AST-guided scoping, batched file operations, and opportunistic context updates that try to fetch the next needed material before the model asks. The pitch is blunt: if coding agents degrade as context grows, then better curation is not a nice extra. It is the product.
That matched the HN discussion almost perfectly. Early commenters asked whether this was really a new model story or just a new wrapper. The author answered that the model was still the default Gemini 3 Flash Preview and that the gains came from the tool chain. Other commenters dug into why AST-based search might beat plain grep on large repositories, especially when common symbol names and bundled files pollute search results. Community discussion noted that once code search gets noisy, the agent can burn context long before it makes a useful change.
The interesting part is not only that Dirac posted a high number on TerminalBench. It is that HN treated the number as evidence in a larger argument about coding agents. The thread reads like a reminder that model progress and harness design are now entangled. Same base model, different search strategy, different edit strategy, different outcome. That is exactly the kind of argument Hacker News likes to keep alive.
Related Articles
HN did not just react to a leaderboard bump. The thread locked onto Dirac's claim that tighter context, hash-anchored edits, and AST-guided retrieval can beat heavier coding agents while spending less.
Cursor has published the Composer 2 technical report, outlining its code-focused continued pretraining, large-scale reinforcement learning pipeline, and CursorBench-led evaluation strategy. The report offers an unusually detailed first-party look at how a production coding agent is trained and measured.
GitHub has paused new Copilot Pro, Pro+, and Student sign-ups after agentic workflows pushed compute demand beyond the old plan structure. The sharper signal is economic: token-based session and weekly limits now matter separately from premium request counts.
Comments (0)
No comments yet. Be the first to comment!