LocalLLaMA Revisits a Layer-Duplication Route to Better Open LLM Scores

What the LocalLLaMA post is arguing

A widely upvoted LocalLLaMA post brought fresh attention to David Noel Ng's long technical write-up on a peculiar way to improve open LLM scores without any fine-tuning. The core claim is easy to state and hard to ignore: duplicate a specific seven-layer block in the middle of Qwen2-72B, leave every weight unchanged, and benchmark performance can improve. No gradient updates, no merged checkpoints, and no RLHF loop are involved. The technique is presented as an inference-time architectural rearrangement rather than a training recipe.

According to the article, the experiments were run on quantized models through ExLlamaV2 using 2x RTX 4090 GPUs. Ng describes scanning all valid (i, j) duplication pairs for an 80-layer model, which yields 3,240 candidate configurations. Instead of optimizing directly on the public leaderboard, he used proxy tasks built around hard math guesses and EQ-Bench social reasoning questions. The reported best configuration is (45, 52), which repeats layers 45 through 51 and effectively expands the model from 72B to 78B parameters without changing the underlying weights.

Why the idea resonates

It aims for better results through inference-time architecture changes rather than training.
The write-up argues that single-layer repetition usually fails, while circuit-sized block duplication can help.
Reported gains include +17.72% on MuSR and +8.16% on MATH, with improvement on five of six leaderboard benchmarks.
The full method is framed as something a determined individual can probe on consumer GPUs.

The most interesting part is arguably the interpretation rather than the leaderboard delta. Ng argues that middle Transformer layers behave less like interchangeable depth and more like functional circuits. Under that view, repeating one layer does little because it duplicates a single step in a reasoning routine, while repeating the right block gives the model a second pass through a coherent internal subroutine. That is a testable claim with obvious overlap with mechanistic interpretability.

It is still important to keep the status of the evidence straight. This is a blog post, not a peer-reviewed paper, and the “functional circuit” framing remains the author's hypothesis. Even so, the post stands out because it turns a speculative idea into a concrete experimental procedure with explicit configurations, measurable deltas, and a hardware budget far below that of large training labs. That combination explains why LocalLLaMA picked it up so quickly.

Source: David Noel Ng's technical write-up. Community discussion: r/LocalLLaMA thread.

LocalLLaMA Revisits a Layer-Duplication Route to Better Open LLM Scores

What the LocalLLaMA post is arguing

Why the idea resonates

Related Articles

Hacker News examines Percepta's claim that transformers can execute programs internally

Reddit Research Notes: A 7-Layer Duplication Trick Climbs the Open LLM Leaderboard

LocalLLaMA Benchmarks Qwen3.5-122B at 198 tok/s on Dual RTX PRO 6000 Blackwell

Related Articles

Hacker News examines Percepta's claim that transformers can execute programs internally
LLM Hacker News Mar 13, 2026 2 min read

Reddit Research Notes: A 7-Layer Duplication Trick Climbs the Open LLM Leaderboard
LLM Reddit Mar 13, 2026 2 min read

LocalLLaMA Benchmarks Qwen3.5-122B at 198 tok/s on Dual RTX PRO 6000 Blackwell
LLM Reddit Apr 10, 2026 2 min read