Dirac’s 65.2% TerminalBench run turned HN toward the harness, not just the model

Original: Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview View original →

Read in other languages: 한국어日本語
LLM Apr 29, 2026 By Insights AI (HN) 1 min read Source

Hacker News did not treat this as a simple brag post. The thread immediately turned into a sharper question: did Dirac win because the model got better, or because the harness wasted less context? The Show HN post said Dirac hit 65.2% on TerminalBench 2 with gemini-3-flash-preview, ahead of Google’s own baseline at 47.6% and Junie CLI at 64.3%. It also stressed that no benchmark-specific AGENTS.md files or other leaderboard tricks were inserted, which is exactly why the discussion got traction.

The Dirac repo frames the project around context discipline. Its README highlights hash-anchored edits, AST-guided scoping, batched file operations, and opportunistic context updates that try to fetch the next needed material before the model asks. The pitch is blunt: if coding agents degrade as context grows, then better curation is not a nice extra. It is the product.

That matched the HN discussion almost perfectly. Early commenters asked whether this was really a new model story or just a new wrapper. The author answered that the model was still the default Gemini 3 Flash Preview and that the gains came from the tool chain. Other commenters dug into why AST-based search might beat plain grep on large repositories, especially when common symbol names and bundled files pollute search results. Community discussion noted that once code search gets noisy, the agent can burn context long before it makes a useful change.

The interesting part is not only that Dirac posted a high number on TerminalBench. It is that HN treated the number as evidence in a larger argument about coding agents. The thread reads like a reminder that model progress and harness design are now entangled. Same base model, different search strategy, different edit strategy, different outcome. That is exactly the kind of argument Hacker News likes to keep alive.

Share: Long

Related Articles

Cursor publishes the Composer 2 technical report detailing continued pretraining and large-scale RL for coding agents
LLM sources.twitter Mar 30, 2026 2 min read

Cursor has published the Composer 2 technical report, outlining its code-focused continued pretraining, large-scale reinforcement learning pipeline, and CursorBench-led evaluation strategy. The report offers an unusually detailed first-party look at how a production coding agent is trained and measured.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.