r/singularity Debates Meta-Harness After Its TerminalBench 2 Lead Over Claude Code

Original: Stanford Researchers Autonomously Improved A Harness And SIGNIFICANTLY Beat Claude Code on TerminalBench 2 View original →

Read in other languages: 한국어日本語
AI Mar 31, 2026 By Insights AI (Reddit) 2 min read 1 views Source

A post on r/singularity drew 286 upvotes and 57 comments by framing the linked project as Stanford researchers autonomously improving a harness and significantly beating Claude Code on TerminalBench 2. Because the post body pointed readers to the Meta-Harness page, the thread quickly moved past the headline and into a discussion of what was actually optimized and why that might matter for agent performance.

The Meta-Harness page describes the work as an end-to-end optimization method for model harnesses. That means the focus is not only the model itself, but the surrounding scaffold that determines how an agent inspects files, calls tools, and responds to execution feedback. This became a central point in the Reddit comments. Several users wanted to pin down what a harness is in practical terms, and whether the gains here say more about orchestration than about a change in the base model.

The page also includes a smaller illustrative result. In a 19-task search, performance improved from Terminus-KIRA at 28.5% to 46.5% by iteration 7. For the full TerminalBench-2 evaluation, the page says the benchmark contains 89 Dockerized tasks across code translation, distributed ML setup, systems programming, bioinformatics, and cryptanalysis. The proposer behind the harness search is described as a coding agent that can inspect full source code, scores, and execution traces in a filesystem. The page explicitly says the proposer is Claude Code using tools such as grep and cat.

  • On Claude Opus 4.6, Meta-Harness reached 76.4%, above Terminus-KIRA at 74.7% and Claude Code at 58.0%, ranking #2 among Opus 4.6 agents on the cited leaderboard page.
  • On Claude Haiku 4.5, Meta-Harness reached 37.6%, above Goose at 35.5% and Claude Code at 27.5%, ranking #1 among Haiku 4.5 agents.

The Reddit discussion focused less on celebrating one leaderboard result and more on the implications. Commenters debated whether AI-designed harnesses can outpace manual development loops, and whether similar treatment will eventually be applied to open models. That gave the thread a broader community angle: people were reading the benchmark result as evidence that system design around a model may still offer a large optimization surface.

That is why the post resonated beyond a single project page. For r/singularity, the interesting question was not just whether Meta-Harness beat Claude Code on one benchmark, but whether automated improvement of tooling and orchestration will become a repeatable path for agent progress. The thread reflected growing interest in the idea that the harness layer, not just the base model, may be one of the main levers for future coding-agent gains.

Share: Long

Related Articles

AI Hacker News 1d ago 2 min read

A March 29 Hacker News post pushed a GitHub issue alleging that Claude Code was running `git fetch origin` plus `git reset --hard origin/main` every 600 seconds against a user repo. The root cause is still unresolved, but the report sharply reopens the repo-safety question for agentic coding tools.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.