LocalLLaMA Revisits a Layer-Duplication Route to Better Open LLM Scores
Original: How I topped the Open LLM Leaderboard using 2x 4090 GPUs — no weights modified. View original →
What the LocalLLaMA post is arguing
A widely upvoted LocalLLaMA post brought fresh attention to David Noel Ng's long technical write-up on a peculiar way to improve open LLM scores without any fine-tuning. The core claim is easy to state and hard to ignore: duplicate a specific seven-layer block in the middle of Qwen2-72B, leave every weight unchanged, and benchmark performance can improve. No gradient updates, no merged checkpoints, and no RLHF loop are involved. The technique is presented as an inference-time architectural rearrangement rather than a training recipe.
According to the article, the experiments were run on quantized models through ExLlamaV2 using 2x RTX 4090 GPUs. Ng describes scanning all valid (i, j) duplication pairs for an 80-layer model, which yields 3,240 candidate configurations. Instead of optimizing directly on the public leaderboard, he used proxy tasks built around hard math guesses and EQ-Bench social reasoning questions. The reported best configuration is (45, 52), which repeats layers 45 through 51 and effectively expands the model from 72B to 78B parameters without changing the underlying weights.
Why the idea resonates
- It aims for better results through inference-time architecture changes rather than training.
- The write-up argues that single-layer repetition usually fails, while circuit-sized block duplication can help.
- Reported gains include +17.72% on MuSR and +8.16% on MATH, with improvement on five of six leaderboard benchmarks.
- The full method is framed as something a determined individual can probe on consumer GPUs.
The most interesting part is arguably the interpretation rather than the leaderboard delta. Ng argues that middle Transformer layers behave less like interchangeable depth and more like functional circuits. Under that view, repeating one layer does little because it duplicates a single step in a reasoning routine, while repeating the right block gives the model a second pass through a coherent internal subroutine. That is a testable claim with obvious overlap with mechanistic interpretability.
It is still important to keep the status of the evidence straight. This is a blog post, not a peer-reviewed paper, and the “functional circuit” framing remains the author's hypothesis. Even so, the post stands out because it turns a speculative idea into a concrete experimental procedure with explicit configurations, measurable deltas, and a hardware budget far below that of large training labs. That combination explains why LocalLLaMA picked it up so quickly.
Source: David Noel Ng's technical write-up. Community discussion: r/LocalLLaMA thread.
Related Articles
A post in r/MachineLearning argues that duplicating a specific seven-layer block inside Qwen2-72B improved benchmark performance without changing any weights.
OpenAI announced GPT-5.4 on March 5, 2026, adding a new general-purpose model and GPT-5.4 Pro with stronger computer use, tool search efficiency, and benchmark improvements over GPT-5.2.
NVIDIA AI Developer introduced Nemotron 3 Super on March 11, 2026 as an open 120B-parameter hybrid MoE model with 12B active parameters and a native 1M-token context window. NVIDIA says the model targets agentic workloads with up to 5x higher throughput than the previous Nemotron Super model.
Comments (0)
No comments yet. Be the first to comment!