LLM Coding Performance: Harness Design, Not Models, Is the Key

Overview

Can Bölük demonstrated that edit tool (harness) design, not model selection, is the primary bottleneck in LLM coding performance. Testing 16 models across 180 React codebase tasks revealed that changing only the edit approach produces dramatic improvements.

Problems with Existing Edit Approaches

Patch format (OpenAI/Codex): Uses diff-style strings but fails catastrophically for non-GPT models. Grok 4's failure rate reached 50.7%.

String replacement (Claude Code): Requires exact character matching including whitespace, generating frequent "String to replace not found" errors.

Neural merging (Cursor): Fine-tuned a separate model solely to fix edit failures, acknowledging the problem's severity.

The Hashline Solution

The author proposes tagging each line with content hashes. Models reference hash tags rather than reproducing text. This approach:

Prevents corruption if files change between reads
Eliminates whitespace reproduction requirements
Shows that models aren't flaky at understanding tasks, but at expressing themselves

Benchmark Results

Grok Code Fast improved from 6.7% to 68.3% success rate—a tenfold gain. This proves "the model isn't flaky at understanding the task. It's flaky at expressing itself."

Key Takeaway

Open-source harness development benefits all models, while vendor-specific optimization creates isolated silos, ultimately hindering ecosystem progress. The highest-leverage innovation point right now is not model improvement, but harness design.

AI X/Twitter May 12, 2026 1 min read

Anthropic's Natural Language Autoencoders Can Read Claude's Internal Thoughts

Anthropic has introduced Natural Language Autoencoders (NLAs), a new interpretability technique that trains Claude to translate its own internal activations into human-readable text—enabling safety audits that can uncover hidden model motivations.

#anthropic #interpretability #claude

AI Hacker News May 18, 2026 1 min read

Project Glasswing: How Anthropic's Mythos AI Chains Vulnerabilities into Working Exploits

Cloudflare tested Anthropic's security-specialized Mythos Preview model against their own infrastructure under Project Glasswing. Mythos can chain low-severity bugs into working exploits, demonstrating reasoning comparable to senior security researchers — but with inconsistent safeguards and significant triage overhead.

#anthropic #security #llm

AI Reddit May 20, 2026 1 min read

ByteDance Releases Lance: 3B Unified Multimodal Model Matching 7B Benchmarks

ByteDance Research has open-sourced Lance, a 3B-parameter unified multimodal model that handles image and video generation, editing, and understanding in a single framework. It achieves top-tier benchmark scores, matching or outperforming models twice its size.

#bytedance #lance #multimodal