LocalLLaMA dissects RYS II and repeated-layer gains in Qwen3.5-27B
Original: RYS II - Repeated layers with Qwen3.5 27B and some hints at a 'Universal Language' View original →
A March 23, 2026 post in r/LocalLLaMA, with 376 upvotes and 61 comments, turned David Noel Ng’s new RYS II write-up into one of the community’s busiest architecture threads of the day. The post revisits the idea that repeating carefully chosen middle transformer layers can improve capability without changing model weights, this time on Qwen3.5-27B.
The blog has two hooks. The first is scientific: hidden-state comparisons across English and Chinese inputs suggest that the middle layers align around content more than surface language, supporting a “universal language” or format-agnostic reasoning space. The second is practical: after a full scan, 3,024 beam-search candidates, and a surrogate model that ranked 2 million configurations, the clean winners were still contiguous mid-stack repeats. On the final shared validation sets, repeating layer 33 alone gave most of the EQ gain at only 1.5625% overhead, while larger blocks such as 31-33, 30-34, and 26-33 pushed performance further with diminishing returns.
- Ng published four FP8 model variants on HuggingFace: S (+1 layer), M (+3), L (+5), and XL (+8).
- The write-up says the Pareto frontier stayed with contiguous blocks even after testing sparse repeats, multi-block beam search, and surrogate-ranked candidates.
- A future ExLlama v3 format could keep duplicated layers as pointers, reducing VRAM growth to mainly compute and KV cache overhead.
LocalLLaMA cared because the work speaks directly to open-weight users. It suggests a path to measurable gains that does not start with expensive full fine-tuning and does not depend on closed APIs. At the same time, the post is careful not to oversell: composition helps, but gains are sublinear, and the efficient frontier matters more than the biggest raw score.
Primary source: RYS II blog post. Community discussion: LocalLLaMA.
Related Articles
LocalLLaMA treated Qwen3.6-27B like a practical ownership moment: not just a model card, but a race to quantize, run, and compare it locally.
A local LLM researcher achieved 95.7% on SimpleQA using Qwen3.6-27B with agentic search on a single consumer GPU.
A LocalLLaMA user has shared a detailed guide for running Qwen 3.6 27B with Multi-Token Prediction support in llama.cpp, achieving 2.5x inference speedup and 262k context on 48GB of memory.
Comments (0)
No comments yet. Be the first to comment!