LocalLLaMA Revisits a Layer-Duplication Route to Better Open LLM Scores
Original: How I topped the Open LLM Leaderboard using 2x 4090 GPUs — no weights modified. View original →
What the LocalLLaMA post is arguing
A widely upvoted LocalLLaMA post brought fresh attention to David Noel Ng's long technical write-up on a peculiar way to improve open LLM scores without any fine-tuning. The core claim is easy to state and hard to ignore: duplicate a specific seven-layer block in the middle of Qwen2-72B, leave every weight unchanged, and benchmark performance can improve. No gradient updates, no merged checkpoints, and no RLHF loop are involved. The technique is presented as an inference-time architectural rearrangement rather than a training recipe.
According to the article, the experiments were run on quantized models through ExLlamaV2 using 2x RTX 4090 GPUs. Ng describes scanning all valid (i, j) duplication pairs for an 80-layer model, which yields 3,240 candidate configurations. Instead of optimizing directly on the public leaderboard, he used proxy tasks built around hard math guesses and EQ-Bench social reasoning questions. The reported best configuration is (45, 52), which repeats layers 45 through 51 and effectively expands the model from 72B to 78B parameters without changing the underlying weights.
Why the idea resonates
- It aims for better results through inference-time architecture changes rather than training.
- The write-up argues that single-layer repetition usually fails, while circuit-sized block duplication can help.
- Reported gains include +17.72% on MuSR and +8.16% on MATH, with improvement on five of six leaderboard benchmarks.
- The full method is framed as something a determined individual can probe on consumer GPUs.
The most interesting part is arguably the interpretation rather than the leaderboard delta. Ng argues that middle Transformer layers behave less like interchangeable depth and more like functional circuits. Under that view, repeating one layer does little because it duplicates a single step in a reasoning routine, while repeating the right block gives the model a second pass through a coherent internal subroutine. That is a testable claim with obvious overlap with mechanistic interpretability.
It is still important to keep the status of the evidence straight. This is a blog post, not a peer-reviewed paper, and the “functional circuit” framing remains the author's hypothesis. Even so, the post stands out because it turns a speculative idea into a concrete experimental procedure with explicit configurations, measurable deltas, and a hardware budget far below that of large training labs. That combination explains why LocalLLaMA picked it up so quickly.
Source: David Noel Ng's technical write-up. Community discussion: r/LocalLLaMA thread.
Related Articles
Why it matters: inference cost is now a product constraint, not only an infrastructure problem. Cohere said its W4A8 path in vLLM is up to 58% faster on TTFT and 45% faster on TPOT versus W4A16 on Hopper.
LocalLLaMA lit up at the idea that a 27B model could tie Sonnet 4.6 on an agentic index, but the thread turned just as fast to benchmark gaming, real context windows, and what people can actually run at home.
A Hacker News discussion is resurfacing a Future Shock explainer that makes LLM memory costs concrete in GPU bytes instead of abstract architecture jargon. The piece traces how GPT-2, Llama 3, DeepSeek V3, Gemma 3, and Mamba-style models handle context retention differently.
Comments (0)
No comments yet. Be the first to comment!