Articles

All AI LLM Humanoid Robots Sciences Gaming Finance

Source:

From To

LLM Hacker News 2d ago 1 min read

Tiny-vLLM teaches LLM inference by rebuilding the stack in C++ and CUDA

The HN reaction centered on the README as much as the code: a small engine that turns vLLM concepts into a guided implementation path.

#llm #cuda #inference

LLM Hacker News 2d ago 1 min read

OpenRouter’s $113M round turns model routing into an infrastructure bet

The HN discussion focused less on funding theater and more on whether a multi-model gateway can stay defensible as AI workloads move into production.

#openrouter #llm #inference

LLM X/Twitter 3d ago 1 min read

DynoSim replays 60.1 minutes of inference traffic in 2.41 seconds

NVIDIA is targeting the hidden cost of LLM serving experiments. Its DynoSim post says the Rust simulator can screen deployment choices before GPU validation, with a blog example replaying 23,608 requests about 1,500x faster than real time.

#nvidia #dynosim #inference

LLM 3d ago 2 min read

DynoSim makes LLM serving tuning a 1,500x faster simulation loop

The expensive part of LLM inference is often the experiment itself. NVIDIA says DynoSim replayed a 23,608-request trace on an Apple M4 MacBook Air in 2.41 seconds, about 1,500x faster than the 60.1-minute serving window it modeled.

#nvidia #dynosim #llm-serving

AI 3d ago 2 min read

Mistral ties a 10MW inference site to its industrial physics AI push

The interesting move is not another chatbot surface. Mistral is packaging physics AI for Airbus, BMW, and ASML with a Q3 2026 10MW inference facility in Les Ulis, shifting its enterprise pitch toward controlled industrial deployment.

#mistral #physics-ai #manufacturing

LLM Reddit 5d ago 1 min read

GLM-5.1 inference gains came from network topology, not new GPUs

LocalLLaMA readers noticed the infrastructure lesson: Zai claimed 15% more GPU inference throughput and 40.6% lower first-token P99 latency with the same GPUs, model, and software stack.

#inference #networking #gpu

LLM 6d ago 2 min read

OpenRouter hits 25T tokens a week as $113M backs model routing

The money is following the layer that decides which model gets each request. OpenRouter says weekly traffic rose 5x in six months to 25 trillion tokens, while its platform now spans 400+ models and more than 8 million users.

#openrouter #inference #routing

LLM Hacker News May 16, 2026 1 min read

Orthrus-Qwen3 Delivers 7.8× Faster Inference With Identical Output

The Orthrus framework achieves up to 7.8× tokens per forward pass on Qwen3 models while maintaining a provably identical output distribution to the original. Its dual-view architecture shares a single KV cache between autoregressive and diffusion pathways.

#inference #qwen3 #speculative-decoding

LLM Reddit May 16, 2026 1 min read

Multi-Token Prediction Support Lands in llama.cpp

PR #22673 merging Multi-Token Prediction support into llama.cpp has been accepted into master. The change brings the inference technique popularized by DeepSeek to the most widely used local LLM inference engine.

#llama-cpp #inference #open-source

LLM Reddit May 10, 2026 1 min read

Running Qwen3.6 35B A3B at 80+ tok/sec on 12GB VRAM With llama.cpp MTP

A LocalLLaMA user shares their config for running Qwen3.6 35B A3B at over 80 tok/sec with 128K context on a 12GB VRAM GPU, using llama.cpp's Multi-Token Prediction support and achieving 80%+ draft acceptance rate.

#local-llm #qwen #llama-cpp

LLM Reddit May 6, 2026 1 min read

Qwen 3.6 27B Achieves 2.5x Faster Local Inference via MTP With 262k Context on 48GB

A LocalLLaMA user has shared a detailed guide for running Qwen 3.6 27B with Multi-Token Prediction support in llama.cpp, achieving 2.5x inference speedup and 262k context on 48GB of memory.

#qwen #mtp #local-llm

LLM Reddit May 6, 2026 1 min read

Google Releases Multi-Token Prediction Drafters for Gemma 4: Up to 3x Speedup

Google has released Multi-Token Prediction (MTP) draft models for the Gemma 4 family, achieving up to 3x inference speedup through speculative decoding without any loss in output quality.

#gemma #google #mtp