HN upvoted the joke because it exposed a real discomfort: one vivid SVG prompt can make a small local model look better than a flagship model, but nobody agrees what that proves.
#qwen
RSS FeedLocalLLaMA upvoted this because it turns a messy GGUF choice into a measurable tradeoff. The post compares community Qwen3.5-9B quants against a BF16 baseline using mean KLD, then the comments push for better visual encoding, Gemma 4 runs, Thireus quants, and long-context testing.
HN latched onto the open-weight angle: a 35B MoE model with only 3B active parameters is interesting if it can actually carry coding-agent work. Qwen says Qwen3.6-35B-A3B improves sharply over Qwen3.5-35B-A3B, while commenters immediately moved to GGUF builds, Mac memory limits, and whether open-model-only benchmark tables are enough context.
LocalLLaMA reacted because the post attacks a very real pain point for running large MoE models on limited VRAM. The author tested a llama.cpp fork that tracks recently routed experts and keeps the hot ones in VRAM for Qwen3.5-122B-A10B, reporting 26.8% faster token generation than layer-based offload at a similar 22GB VRAM budget.
LocalLLaMA reacted because the joke-like idea of an LLM tuning its own runtime came with concrete benchmark numbers. The author says llm-server v2 adds --ai-tune, feeding llama-server help into a tuning loop that searches flag combinations and caches the fastest config; on their rig, Qwen3.5-27B Q4_K_M moved from 18.5 tok/s to 40.05 tok/s.
LocalLLaMA paid attention to this post because it looked like real engineering cleanup instead of another inflated speed screenshot. On April 13, 2026, the author said a stock-MLX baseline for Qwen3.5-9B at 2048 tokens improved from 30.96 tok/s to 127.07 tok/s, with 89.36% acceptance and the full runtime released as open source.
r/LocalLLaMA liked this comparison because it replaces reputation and anecdote with a more explicit distribution-based yardstick. The post ranks community Qwen3.5-9B GGUF quants by mean KLD versus a BF16 baseline, with Q8_0 variants leading on fidelity and several IQ4/Q5 options standing out on size-to-drift trade-offs.
Hacker News Zeroes In on I-DLM as a Diffusion LLM That Might Keep AR Quality Without Giving Up Speed
Hacker News readers are treating this less like another diffusion-text curiosity and more like a possible faster serving path that still stays close to autoregressive quality. The project page claims I-DLM-8B reaches 69.6 on AIME-24, 45.7 on LiveCodeBench-v6, and 2.9-4.1x higher throughput at high concurrency.
A detailed r/LocalLLaMA benchmark reports single- and dual-GPU numbers for Qwen3.5-27B int4 on Intel Arc Pro B70 32GB using Intel’s vLLM fork. The setup is still finicky, but the measurements outline a practical path for local serving on Intel hardware.
A LocalLLaMA implementation report says a native MLX DFlash runtime can speed up Qwen inference on Apple Silicon by more than 2x in several settings. The notable part is not only the throughput gain, but the claim that outputs remain bit-for-bit identical to the greedy baseline.
A high-engagement LocalLLaMA post shared reproducible benchmark data showing Qwen3.5-122B NVFP4 decoding around 198 tok/s on a dual RTX PRO 6000 Blackwell system using SGLang b12x+NEXTN and a PCIe switch topology.
A recent r/LocalLLaMA post presents Qwen3.5 27B as an unusually strong local inference sweet spot. The author reports about 19.7 tokens per second on an RTX A6000 48GB with llama.cpp and a 32K context, while the comments turn into a detailed debate about dense-versus-MoE VRAM economics.