vLLM Lands in the First MLPerf Vision-Language Benchmark Submission

Original: We're proud to share that @NVIDIA submitted the first-ever MLPerf Vision Language Model (VLM) performance benchmark using vLLM. This achievement showcases the strength of our ongoing collaboration with NVIDIA Engineering. Check out their MLPerf blog and watch our On Demand Talk at GTC to learn more about how we are delivering the best performance on NVIDIA hardware. 🔗 Blog: http://developer.nvidia.com/blog/nvidia-platform-delivers-lowest-token-cost-enabled-by-extreme-co-design/ 🔗 Talk: http://nvidia.com/en-us/on-demand/session/gtc26-s82059/ View original →

LLM Apr 10, 2026 By Insights AI 1 min read Source

In an April 9 X post, the vLLM project said NVIDIA submitted the first-ever MLPerf Vision Language Model benchmark using vLLM. The linked NVIDIA Technical Blog says the Qwen3-VL-235B-A22B test is the first multimodal model added to the MLPerf Inference suite, with offline and server scenarios included in v6.0. NVIDIA reported 79 samples per second in offline mode and 68 queries per second in server mode for that benchmark entry.

The broader NVIDIA post is not a vLLM-only announcement. It positions the VLM result inside a larger Blackwell Ultra performance story, saying continuous co-optimization across hardware and open-source software produced up to 2.7x throughput gains and more than 60% lower cost per token on the same infrastructure for some workloads. But the ecosystem detail that matters is the attribution: NVIDIA says the Qwen3-VL submission used the vLLM framework, while other newly added benchmarks relied on separate tools such as TensorRT-LLM VisualGen.

That matters because MLPerf still has outsized signaling power for operators and model-serving teams. If vLLM is now part of the first multimodal track in the suite, the project’s role is widening beyond text-only serving into image-heavy and mixed-modality inference. The result does not prove that one stack wins every deployment, but it does show that open-source serving frameworks are no longer peripheral to top-tier multimodal benchmarking. They are now part of the benchmark headline itself.

LLM Reddit Mar 16, 2026 2 min read

LocalLLaMA Benchmark Argues RTX PRO 6000 SM120 Is Being Held Back by Broken CUTLASS NVFP4 MoE Kernels

A March 12, 2026 LocalLLaMA benchmark post claims the best sustained decode for Qwen3.5-397B NVFP4 on 4x RTX PRO 6000 Blackwell GPUs is 50.5 tok/s with Marlin, because native CUTLASS grouped GEMM paths on SM120 fail or fall back.

#[#"#q

LLM Reddit Mar 15, 2026 2 min read

LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix

A March 14, 2026 LocalLLaMA post outlined a CUTLASS and FlashInfer patch for SM120 Blackwell workstations, claiming major gains for Qwen3.5-397B NVFP4 inference and linking the work to FlashInfer PR #2786.

#[#"#q

LLM Reddit 2d ago 2 min read

LocalLLaMA Flags DFlash as an Open-Source Route to Faster Speculative Decoding

A LocalLLaMA thread pulled attention to DFlash, a block-diffusion draft model for speculative decoding whose paper claims lossless acceleration above 6x and direct support for vLLM, SGLang, and selected Transformers backends.

#[#"#s

vLLM Lands in the First MLPerf Vision-Language Benchmark Submission

Related Articles

LocalLLaMA Benchmark Argues RTX PRO 6000 SM120 Is Being Held Back by Broken CUTLASS NVFP4 MoE Kernels

LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix

LocalLLaMA Flags DFlash as an Open-Source Route to Faster Speculative Decoding

Comments (0)

Leave a Comment