#inference

LLM Hacker News Apr 2, 2026 2 min read

Hacker News revisits the KV cache trade-offs behind long-context LLMs

A Hacker News discussion is resurfacing a Future Shock explainer that makes LLM memory costs concrete in GPU bytes instead of abstract architecture jargon. The piece traces how GPT-2, Llama 3, DeepSeek V3, Gemma 3, and Mamba-style models handle context retention differently.

#kv-cache #inference #transformers

LLM sources.twitter Apr 1, 2026 2 min read

Together Research releases Aurora for RL-based adaptive speculative decoding

Together Research said on March 31, 2026 that Aurora is an open-source framework for adaptive speculative decoding that learns from live inference traces and updates the speculator asynchronously without interrupting serving. Together’s blog and paper say Aurora reframes the problem as asynchronous RL and can deliver 1.25x additional speedup over a strong static speculator as traffic shifts.

#together-ai #aurora #speculative-decoding

LLM Reddit Apr 1, 2026 2 min read

Reddit Tracks llama.cpp's attn-rot Push to Raise KV Cache Quality

A LocalLLaMA thread spotlighted ggerganov's attn-rot work for llama.cpp, a simple rotation-based approach to improve KV cache quantization without introducing new formats. The appeal is that quality appears to improve sharply at low precision while throughput stays in roughly the same band.

#llama.cpp #quantization #kv-cache

LLM Hacker News Apr 1, 2026 1 min read

Show HN Puts 1-Bit Bonsai and Ultra-Dense Edge Inference on the Radar

A notable Hacker News launch this week came from Prism ML, which is positioning 1-Bit Bonsai as the first commercially viable family of 1-bit LLMs. The pitch is less about bigger models and more about intelligence density, device fit, and the economics of edge inference.

#edge-ai #1-bit-llm #inference

LLM Reddit Mar 31, 2026 2 min read

LocalLLaMA Tracks a llama.cpp Experiment for CPU-Offloaded Weight Prefetching

A well-received LocalLLaMA post spotlighted a llama.cpp experiment that prefetches weights while layers are offloaded to CPU memory, aiming to recover prompt-processing speed for dense and smaller MoE models at longer contexts.

#llama.cpp #local-llm #inference

LLM Reddit Mar 30, 2026 2 min read

r/LocalLLaMA Details an Autoresearch Push to 20.34 tok/s for Qwen3.5-397B on M5 Max

A new r/LocalLLaMA benchmark post says an M5 Max system pushed Qwen3.5-397B to 20.34 tok/s through SSD streaming, with I/O parallelism, temporal expert prediction, and Q3-GGUF experts doing most of the work.

#qwen #apple-silicon #inference

LLM Mar 30, 2026 2 min read

NVIDIA puts Dynamo 1.0 into production as an inference OS for AI factories

NVIDIA announced Dynamo 1.0 on March 16, 2026 as a production-grade open-source layer for generative and agentic inference. The release matters because it ties Blackwell performance gains, lower token economics and native integration with major open-source frameworks into one operating model.

#nvidia #dynamo #inference

LLM Reddit Mar 30, 2026 2 min read

r/MachineLearning Pushes a 94-Endpoint LLM Benchmark Into the Spotlight

A March 1 r/MachineLearning post compared 94 LLM endpoints across 25 providers and argued that open models were closing to within a single-digit quality gap of top proprietary systems. The real takeaway is operational: model choice is now about intelligence, price, speed, and deployment freedom at the same time.

#llm-benchmarks #open-source #model-evaluation

LLM Reddit Mar 29, 2026 2 min read

MachineLearning Highlights TurboQuant for Weights as 4-Bit Quantization Gets Practical

A new r/MachineLearning post pushes TurboQuant beyond KV-cache talk and into weight compression, with a GitHub implementation that targets drop-in low-bit LLM inference.

#quantization #llm #inference

AI Mar 29, 2026 2 min read

Meta maps MTIA 300-500 roadmap to scale AI services for billions

Meta said its in-house MTIA roadmap now spans MTIA 300, 400, 450, and 500. The company said the 2026 and 2027 deployments are aimed at lowering the cost and latency of serving GenAI workloads at massive scale.

#meta #ai-chips #inference

LLM Reddit Mar 28, 2026 2 min read

LocalLLaMA Follows a 1.1M Tok/s Qwen 3.5 27B Run as vLLM Tuning Becomes the Real Story

A March 26, 2026 r/LocalLLaMA post about serving Qwen 3.5 27B on Google Cloud B200 clusters reached 205 points and 52 comments at crawl time. The linked write-up reports 1,103,941 total tokens per second on 12 nodes after switching from tensor to data parallelism, shrinking context length, enabling FP8 KV cache, and using MTP-1 speculative decoding.

#qwen #vllm #nvidia-b200

LLM Reddit Mar 28, 2026 2 min read

LocalLLaMA Tracks NVIDIA's gpt-oss-puzzle-88B as Puzzle Shrinks gpt-oss-120b for Cheaper Serving

A March 26, 2026 r/LocalLLaMA post linking NVIDIA's `gpt-oss-puzzle-88B` model card reached 284 points and 105 comments at crawl time. NVIDIA says the 88B MoE model uses its Puzzle post-training NAS pipeline to cut parameters and KV-cache costs while keeping reasoning accuracy near or above the parent model.

#nvidia #gpt-oss #open-weights