DynoSim makes LLM serving tuning a 1,500x faster simulation loop

The hardest part of LLM serving optimization is often not the model. It is the cost of testing every serious configuration on real GPUs. NVIDIA’s DynoSim turns that problem into a simulate-first loop: screen tensor-parallel shapes, prefill/decode splits, worker counts, routing policies, KV cache behavior, and autoscaling choices before spending cluster time on the shortlist.

DynoSim is a workload-driven discrete-event simulation of the NVIDIA Dynamo serving stack. It places Router, Planner, scheduler behavior, KV cache effects, and workload traces on one virtual timeline. NVIDIA says a single-threaded Rust offline replay on an Apple M4 MacBook Air simulated the full 23,608-request Mooncake trace in 2.41 seconds. The modeled serving window was 60.1 minutes, making the replay roughly 1,500x faster than real time.

That speed matters because inference tuning is a coupled systems problem. A local improvement in routing can move pressure into decode. A cache policy can help TTFT while shifting the throughput curve. A cold-start delay can erase the benefit of autoscaling. DynoSim is designed to explore thousands of candidates cheaply, then validate only the Pareto shortlist on hardware. In NVIDIA’s MiniMax-M2.5 FP8 on HGX B200 experiment, KV-aware routing lifted prefix reuse from about 0.38 to 0.44-0.45 while lowering TTFT compared with round-robin placement.

The simulator also models cache tiers and scaling behavior. With the KVBM G2 host-memory tier enabled, NVIDIA reports less prefill recompute and a 19.3% mean TTFT improvement at concurrency 32. In a Planner experiment using Qwen3-32B at TP=2 on H200-SXM, dynamic deployment reached a better cost-latency point than static deployments, and the best scaling interval landed around 5-10 seconds because shorter intervals created churn while longer ones reacted too slowly.

The broader implication is operational. Agent traffic produces multi-turn bursts, uneven prompt lengths, and changing cache reuse patterns. Those dynamics are hard to capture with small tests. If DynoSim becomes the inner loop for replaying recent production traces and recommending new configurations, LLM serving could move from one-time launch tuning to continuous adaptation against the current workload.

DynoSim makes LLM serving tuning a 1,500x faster simulation loop

Related Articles

DynoSim replays 60.1 minutes of inference traffic in 2.41 seconds

NVIDIA’s Nemotron-TwoTower tests diffusion-style generation for LLMs

Colibri Runs GLM-5.2 on a Slow PC, and the Real Debate Is Memory Movement