#blackwell

LLM Reddit Apr 10, 2026 2 min read

LocalLLaMA Benchmarks Qwen3.5-122B at 198 tok/s on Dual RTX PRO 6000 Blackwell

A high-engagement LocalLLaMA post shared reproducible benchmark data showing Qwen3.5-122B NVFP4 decoding around 198 tok/s on a dual RTX PRO 6000 Blackwell system using SGLang b12x+NEXTN and a PCIe switch topology.

#qwen #blackwell #inference

LLM sources.twitter Apr 10, 2026 1 min read

vLLM Lands in the First MLPerf Vision-Language Benchmark Submission

vLLM said NVIDIA used the framework for the first MLPerf vision-language benchmark submission built on Qwen3-VL. NVIDIA’s accompanying blog places that result inside a broader Blackwell Ultra push that claims up to 2.7x throughput gains and more than 60% lower token cost on the same infrastructure for some workloads.

#vllm #mlperf #benchmark

AI sources.twitter Apr 10, 2026 1 min read

PyTorch Shows Faster Diffusion Inference on Blackwell With TorchAO Quantization

PyTorch said on April 8 that MXFP8 and NVFP4 quantization with Diffusers and TorchAO can cut diffusion latency on NVIDIA B200 GPUs, with NVFP4 reaching up to 1.68x speedups. The accompanying blog frames selective quantization and regional compilation as the practical recipe for better latency-memory tradeoffs.

#pytorch #torchao #blackwell

LLM sources.twitter Apr 8, 2026 2 min read

Cursor details warp decode for Blackwell GPUs, claiming 1.84x faster MoE inference

On April 6, 2026, Cursor said on X that it rebuilt how MoE models generate tokens on NVIDIA Blackwell GPUs. In a companion engineering post, the company said its "warp decode" approach improves throughput by 1.84x while producing outputs 1.4x closer to an FP32 reference.

#cursor #moe #inference

LLM Mar 30, 2026 2 min read

NVIDIA puts Dynamo 1.0 into production as an inference OS for AI factories

NVIDIA announced Dynamo 1.0 on March 16, 2026 as a production-grade open-source layer for generative and agentic inference. The release matters because it ties Blackwell performance gains, lower token economics and native integration with major open-source frameworks into one operating model.

#nvidia #dynamo #inference

LLM Reddit Mar 24, 2026 1 min read

LocalLLaMA highlights FlashAttention-4 gains on Blackwell and the limits for everyday GPUs

A technical LocalLLaMA thread translated the FlashAttention-4 paper into practical deployment guidance, emphasizing huge Blackwell gains, faster Python-based kernel development, and the fact that most A100 or consumer-GPU users cannot use the full benefits yet.

#flashattention #inference #gpu

Sciences Mar 21, 2026 2 min read

NVIDIA and Oracle plan DOE's largest AI supercomputer for scientific discovery

NVIDIA and Oracle said on March 16, 2026 that they will build the U.S. Department of Energy's largest AI supercomputer at Argonne National Laboratory. The Solstice and Equinox systems combine 110,000 Blackwell GPUs and a stated 2,200 exaflops of AI performance for scientific discovery.

#nvidia #oracle #doe

AI sources.twitter Mar 20, 2026 2 min read

NVIDIA launches SOL-ExecBench to measure GPU kernel optimization against hardware limits

NVIDIA announced SOL-ExecBench on March 20, 2026, a benchmark for real-world GPU kernels that scores optimized CUDA and PyTorch code against Speed-of-Light hardware bounds on NVIDIA B200 systems. The release packages 235 kernel optimization problems drawn from 124 AI models across BF16, FP8, and NVFP4 workloads.

#nvidia #cuda #benchmarking

LLM Mar 19, 2026 2 min read

NVIDIA moves Dynamo 1.0 into production as an inference operating system for AI factories

At GTC on March 16, 2026, NVIDIA announced Dynamo 1.0 as a production-grade open source inference stack for generative and agentic AI. NVIDIA says Dynamo can boost Blackwell inference performance by up to 7x while integrating with major frameworks and cloud providers.

#nvidia #dynamo #inference

AI sources.twitter Mar 17, 2026 2 min read

NVIDIA says Dynamo 1.0 is entering production as an inference OS for AI factories

NVIDIA said on March 16, 2026 that Dynamo 1.0 is entering production as open source software for generative and agentic inference at scale. The company says the stack can raise Blackwell inference performance by up to 7x and is already supported across major cloud providers, inference platforms, and AI-native companies.

#nvidia #dynamo #inference

LLM Reddit Mar 16, 2026 2 min read

LocalLLaMA Benchmark Argues RTX PRO 6000 SM120 Is Being Held Back by Broken CUTLASS NVFP4 MoE Kernels

A March 12, 2026 LocalLLaMA benchmark post claims the best sustained decode for Qwen3.5-397B NVFP4 on 4x RTX PRO 6000 Blackwell GPUs is 50.5 tok/s with Marlin, because native CUTLASS grouped GEMM paths on SM120 fail or fall back.

#qwen #blackwell #vllm

LLM Reddit Mar 15, 2026 2 min read

LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix

A March 14, 2026 LocalLLaMA post outlined a CUTLASS and FlashInfer patch for SM120 Blackwell workstations, claiming major gains for Qwen3.5-397B NVFP4 inference and linking the work to FlashInfer PR #2786.

#qwen #blackwell #vllm