HN Examines llm-circuit-finder: Layer Duplication as Capability Steering, Not a Free LLM Upgrade
Original: Show HN: Duplicate 3 layers in a 24B LLM, logical deduction .22→.76. No training View original →
What the source material claims
llm-circuit-finder argues that some transformer capabilities live in small contiguous reasoning circuits. Instead of changing weights or training adapters, the project duplicates selected layers in the forward path so hidden states traverse the same block twice. The Show HN post says the author replicated David Ng's RYS method on consumer AMD GPUs, specifically RX 7900 XT + RX 6950 XT, and found strong effects in Devstral-24B and Qwen2.5-Coder-32B.
- Devstral-24B with layers 12-14 duplicated once: BBH Logical Deduction 0.22 to 0.76, GSM8K strict 0.48 to 0.64, MBPP 0.72 to 0.78 in the HN summary.
- Qwen2.5-Coder-32B with layers 7-9 duplicated once: reasoning probe 76% to 94%.
- The repo ships
sweep.py,layer_path.py,gguf_surgery.py,compare_eval.py, andvisualize.pyfor search, GGUF editing, evaluation comparison, and visualization.
On its face, this is a compelling idea. The README says different duplication patterns can create different cognitive modes from the same weights, and that circuit boundaries are sharp enough that moving the duplicated block by one layer can erase or invert the effect. That frames the project less as ordinary fine-tuning and more as explicit routing control over a fixed model.
Where the skepticism starts
The important nuance is that the README is more careful than the HN headline. What the project appears to show is capability steering, not universal across-the-board improvement. Some reasoning-heavy tasks improve, but other capabilities can weaken. That distinction matters because the HN submission says Nothing degraded, while the repo's broader evidence does not support reading the result as a free win.
The clearest example is the full benchmark table for Devstral surgery. The highlighted HN metrics emphasize reasoning gains, but the README's broader comparison shows weaker IFEval/MBPP and a lower average across all listed metrics, moving from 0.7610 to 0.7488. In other words, the project may make one capability profile better while making another worse. For practitioners, that is a meaningful tradeoff, not a detail to bury below the headline.
There is also a practical cost. The conceptual pitch is same weights, no training, different routing, but the current implementation physically duplicates layers inside GGUF files. The README says 3 extra layers on a 24B model cost about 1.5 GiB extra VRAM and about 7.5% slower inference. So even if the mechanism avoids weight updates, it is not operationally free in memory or latency.
Why the HN thread mattered
This was a community-sourced story in the best sense: the GitHub repo provided the raw claim, and the Hacker News thread pressure-tested it. The post drew 257 points and 82 comments, which turned the discussion into more than a link share. Commenters challenged the novelty of layer duplication, pointed to prior art in layer replay, and asked what was genuinely new beyond David Ng's earlier work. The author's answer was that the new contribution is a sweep-and-validation toolkit plus benchmark evidence that exact 3-layer boundaries can matter for specific models.
That exchange is what makes the story useful for practitioners. If the real takeaway is not all models get better, but certain routes steer certain behaviors, then evaluation discipline becomes the core issue. Teams would need to ask whether the same effect holds across seeds, prompts, quantizations, runtimes, and downstream tuning, and whether the capability gained is worth the average performance lost elsewhere.
The project is therefore best read as an interesting model-surgery experiment with reproducible scripts and concrete deltas, not as proof that duplicated layers universally upgrade LLMs. The evidence so far points to capability steering via layer routing, with measurable tradeoffs in benchmark coverage and runtime cost.
Source: the original repository is https://github.com/alainnothere/llm-circuit-finder and the Hacker News discussion is https://news.ycombinator.com/item?id=47431671.
Related Articles
The LocalLLaMA discussion around NVIDIA’s new model focused on an unusual mix of scale efficiency and benchmark ambition: 30B total parameters, 3B activated, plus separate thinking and instruct modes.
Google DeepMind has released Gemini 3.1 Pro with over 2x reasoning performance versus Gemini 3 Pro. The model scores 77.1% on ARC-AGI-2 (up from 31.1%), 80.6% on SWE-bench Verified, and tops 12 of 18 tracked benchmarks at unchanged $2/$12 per million token pricing.
A top Hacker News discussion tracked Google’s Gemini 3.1 Pro rollout. Google positions it as a stronger reasoning baseline, highlighting a 77.1% ARC-AGI-2 score and broad preview availability across developer, enterprise, and consumer channels.
Comments (0)
No comments yet. Be the first to comment!