#inference

LLM Hacker News May 31, 2026 1 min read

Tiny-vLLM teaches LLM inference by rebuilding the stack in C++ and CUDA

The HN reaction centered on the README as much as the code: a small engine that turns vLLM concepts into a guided implementation path.

#llm #cuda #inference

LLM Hacker News May 31, 2026 1 min read

OpenRouter’s $113M round turns model routing into an infrastructure bet

The HN discussion focused less on funding theater and more on whether a multi-model gateway can stay defensible as AI workloads move into production.

#openrouter #llm #inference

LLM X/Twitter May 31, 2026 1 min read

DynoSim replays 60.1 minutes of inference traffic in 2.41 seconds

NVIDIA is targeting the hidden cost of LLM serving experiments. Its DynoSim post says the Rust simulator can screen deployment choices before GPU validation, with a blog example replaying 23,608 requests about 1,500x faster than real time.

#nvidia #dynosim #inference

LLM May 30, 2026 2 min read

DynoSim makes LLM serving tuning a 1,500x faster simulation loop

The expensive part of LLM inference is often the experiment itself. NVIDIA says DynoSim replayed a 23,608-request trace on an Apple M4 MacBook Air in 2.41 seconds, about 1,500x faster than the 60.1-minute serving window it modeled.

#nvidia #dynosim #llm-serving

AI May 30, 2026 2 min read

Mistral ties a 10MW inference site to its industrial physics AI push

The interesting move is not another chatbot surface. Mistral is packaging physics AI for Airbus, BMW, and ASML with a Q3 2026 10MW inference facility in Les Ulis, shifting its enterprise pitch toward controlled industrial deployment.

#mistral #physics-ai #manufacturing

LLM Reddit May 28, 2026 1 min read

GLM-5.1 inference gains came from network topology, not new GPUs

LocalLLaMA readers noticed the infrastructure lesson: Zai claimed 15% more GPU inference throughput and 40.6% lower first-token P99 latency with the same GPUs, model, and software stack.

#inference #networking #gpu

LLM May 27, 2026 2 min read

OpenRouter hits 25T tokens a week as $113M backs model routing

The money is following the layer that decides which model gets each request. OpenRouter says weekly traffic rose 5x in six months to 25 trillion tokens, while its platform now spans 400+ models and more than 8 million users.

#openrouter #inference #routing

LLM Hacker News May 16, 2026 1 min read

Orthrus-Qwen3 Delivers 7.8× Faster Inference With Identical Output

The Orthrus framework achieves up to 7.8× tokens per forward pass on Qwen3 models while maintaining a provably identical output distribution to the original. Its dual-view architecture shares a single KV cache between autoregressive and diffusion pathways.

#inference #qwen3 #speculative-decoding

LLM Reddit May 10, 2026 1 min read

Running Qwen3.6 35B A3B at 80+ tok/sec on 12GB VRAM With llama.cpp MTP

A LocalLLaMA user shares their config for running Qwen3.6 35B A3B at over 80 tok/sec with 128K context on a 12GB VRAM GPU, using llama.cpp's Multi-Token Prediction support and achieving 80%+ draft acceptance rate.

#local-llm #qwen #llama-cpp

LLM Reddit May 6, 2026 1 min read

Qwen 3.6 27B Achieves 2.5x Faster Local Inference via MTP With 262k Context on 48GB

A LocalLLaMA user has shared a detailed guide for running Qwen 3.6 27B with Multi-Token Prediction support in llama.cpp, achieving 2.5x inference speedup and 262k context on 48GB of memory.

#qwen #mtp #local-llm

LLM Reddit May 6, 2026 1 min read

Google Releases Multi-Token Prediction Drafters for Gemma 4: Up to 3x Speedup

Google has released Multi-Token Prediction (MTP) draft models for the Gemma 4 family, achieving up to 3x inference speedup through speculative decoding without any loss in output quality.

#gemma #google #mtp

LLM Reddit May 4, 2026 1 min read

Llama.cpp Multi-Token Prediction Support Enters Beta, Closing the vLLM Performance Gap

llama.cpp's Multi-Token Prediction (MTP) support has entered beta, currently covering Qwen3.5 MTP. Combined with maturing tensor-parallel support, most token generation speed gaps between llama.cpp and vLLM are expected to close.

#llama-cpp #mtp #local-llm