A Hacker News discussion is resurfacing a Future Shock explainer that makes LLM memory costs concrete in GPU bytes instead of abstract architecture jargon. The piece traces how GPT-2, Llama 3, DeepSeek V3, Gemma 3, and Mamba-style models handle context retention differently.
#inference
RSS FeedTogether Research said on March 31, 2026 that Aurora is an open-source framework for adaptive speculative decoding that learns from live inference traces and updates the speculator asynchronously without interrupting serving. Together’s blog and paper say Aurora reframes the problem as asynchronous RL and can deliver 1.25x additional speedup over a strong static speculator as traffic shifts.
A LocalLLaMA thread spotlighted ggerganov's attn-rot work for llama.cpp, a simple rotation-based approach to improve KV cache quantization without introducing new formats. The appeal is that quality appears to improve sharply at low precision while throughput stays in roughly the same band.
A notable Hacker News launch this week came from Prism ML, which is positioning 1-Bit Bonsai as the first commercially viable family of 1-bit LLMs. The pitch is less about bigger models and more about intelligence density, device fit, and the economics of edge inference.
A well-received LocalLLaMA post spotlighted a llama.cpp experiment that prefetches weights while layers are offloaded to CPU memory, aiming to recover prompt-processing speed for dense and smaller MoE models at longer contexts.
A new r/LocalLLaMA benchmark post says an M5 Max system pushed Qwen3.5-397B to 20.34 tok/s through SSD streaming, with I/O parallelism, temporal expert prediction, and Q3-GGUF experts doing most of the work.
NVIDIA announced Dynamo 1.0 on March 16, 2026 as a production-grade open-source layer for generative and agentic inference. The release matters because it ties Blackwell performance gains, lower token economics and native integration with major open-source frameworks into one operating model.
A March 1 r/MachineLearning post compared 94 LLM endpoints across 25 providers and argued that open models were closing to within a single-digit quality gap of top proprietary systems. The real takeaway is operational: model choice is now about intelligence, price, speed, and deployment freedom at the same time.
A new r/MachineLearning post pushes TurboQuant beyond KV-cache talk and into weight compression, with a GitHub implementation that targets drop-in low-bit LLM inference.
Meta said its in-house MTIA roadmap now spans MTIA 300, 400, 450, and 500. The company said the 2026 and 2027 deployments are aimed at lowering the cost and latency of serving GenAI workloads at massive scale.
A March 26, 2026 r/LocalLLaMA post about serving Qwen 3.5 27B on Google Cloud B200 clusters reached 205 points and 52 comments at crawl time. The linked write-up reports 1,103,941 total tokens per second on 12 nodes after switching from tensor to data parallelism, shrinking context length, enabling FP8 KV cache, and using MTP-1 speculative decoding.
A March 26, 2026 r/LocalLLaMA post linking NVIDIA's `gpt-oss-puzzle-88B` model card reached 284 points and 105 comments at crawl time. NVIDIA says the 88B MoE model uses its Puzzle post-training NAS pipeline to cut parameters and KV-cache costs while keeping reasoning accuracy near or above the parent model.