#llm-serving

LLM May 30, 2026 2 min read

DynoSim makes LLM serving tuning a 1,500x faster simulation loop

The expensive part of LLM inference is often the experiment itself. NVIDIA says DynoSim replayed a 23,608-request trace on an Apple M4 MacBook Air in 2.41 seconds, about 1,500x faster than the 60.1-minute serving window it modeled.

#nvidia #dynosim #llm-serving

LLM Reddit Mar 1, 2026 2 min read

r/LocalLLaMA Benchmarks: <code>Krasis</code> reports 3,324 tok/s prefill for 80B MoE on one RTX 5080

A r/LocalLLaMA post (score 180, 53 comments) shared benchmark data for <code>Krasis</code>, a hybrid CPU/GPU runtime aimed at large MoE models. The key claim is that GPU-heavy prefill plus CPU decode can reduce long-context waiting time even when full models do not fit in consumer VRAM.

#moe #inference-runtime #llm-serving