r/MachineLearning Elevates a 2x 4090 LLM Layer-Duplication Experiment
Original: How I topped the Open LLM Leaderboard using 2x 4090 GPUs - Research notes in Blog form View original →
Why Reddit pushed this upward
The r/MachineLearning post sends readers to David Noel Ng's detailed blog entry on what he calls LLM Neuroanatomy. The headline claim is unusual enough to stand out immediately: he says he reached the top of the Open LLM Leaderboard by duplicating a specific seven-layer middle block inside Qwen2-72B, without changing a single weight and without running gradient descent. That makes the story less about ordinary fine-tuning and more about structural intervention inside an already-trained model.
The most interesting part is the claimed granularity of the effect. According to the post, duplicating one layer did nothing, too few layers did nothing, and too many layers made performance worse. Only a circuit-sized block of roughly seven layers seemed to help. Ng interprets that as evidence that pretraining may carve out discrete functional circuits within the transformer stack. That is not a settled result, and the post does not present a peer-reviewed paper. But it is exactly the sort of strong, testable hypothesis that gets researchers and practitioners arguing in a useful way.
Why practitioners are interested
Reddit also responded to the compute story. The work is framed as something that started on two RTX 4090 GPUs rather than a hyperscale cluster. That matters because it suggests architecture-level experimentation is not reserved only for large labs. If the effect replicates across newer model families, it could influence how people think about depth scaling, model editing, and benchmark-oriented open-model research.
- The intervention is layer-block duplication, not weight merging or finetuning.
- The proposed lesson is that useful capability may live in reusable middle-layer circuits.
- The biggest open issue is replication across models, tasks, and evaluation setups.
That is why the thread landed well on r/MachineLearning. It combines an audacious empirical claim with a mechanism people can actually probe, challenge, and reproduce.
Related Articles
A fast-rising LocalLLaMA post resurfaced David Noel Ng's write-up on duplicating a seven-layer block inside Qwen2-72B, a no-training architecture tweak that reportedly lifted multiple Open LLM Leaderboard benchmarks.
A high-scoring LocalLLaMA post says Qwen 3.5 9B on a 16GB M1 Pro handled memory recall and basic tool calling well enough for real agent work, even though creative reasoning still trailed frontier models.
A LocalLLaMA thread reported a large prompt-processing speedup on Qwen3.5-27B by lowering llama.cpp `--ubatch-size` to 64 on an RX 9070 XT. The interesting part is not a universal magic number, but the reminder that prompt ingestion and token generation can respond very differently to `n_ubatch` tuning.
Comments (0)
No comments yet. Be the first to comment!