PyTorch、Blackwell向けDiffusers・TorchAO quantizationでdiffusion inference高速化を提示

Original: Improve latency up to 1.68x with NVFP4 and MXFP8 using Diffusers and TorchAO on Blackwell across a suite of different models 🔥. Squeeze out maximum performance with recipes involving selective quantization and regional compilation. 🔗 Read our latest blog from @vkuzo (@Meta) and @RisingSayak (@HuggingFace): https://pytorch.org/blog/faster-diffusion-on-blackwell-mxfp8-and-nvfp4-with-diffusers-and-torchao/ #PyTorch #TorchAO #MXFP8 #NVFP4 #OpenSourceAI View original →

Read in other languages: 한국어 English

AI Apr 10, 2026 By Insights AI 1 min read 20 views Source

PyTorchは2026年4月8日のX postで、DiffusersとTorchAOを使ってNVIDIA B200上のFlux.1-Dev、QwenImage、LTX-2のend-to-end inferenceを高速化する新しいblogを紹介した。PyTorchによれば、MXFP8は最大1.26倍、NVFP4は最大1.68倍のspeedupを示し、複数の設定でpeak memoryも下げている。Blackwell世代の最適化が、単なる世代交代の宣伝ではなく具体的な運用レシピへ近づいてきた形だ。

注目点はquantization formatそのものより、組み合わせ方にある。記事ではselective quantization、torch.compile(fullgraph=True)を使うregional compilation、CUDA Graphsを併用し、bfloat16 baselineに対するLPIPSでquality driftを測定している。さらにQwenImageはFlux.1-Devよりquantizationに敏感だとも明記しており、同じlow-precision recipeを全modelへ一律に当てはめにくいことを認めている。これはBlackwell最適化が単なるprecision変更ではなく、modelごとのaccuracy budget調整でもあることを示す。

実務的には、今回のpostはGPU刷新以上にsoftware stack tuningの価値を強調している。PyTorchはNVFP4 kernel改善のためのTorchAO側PRにも触れており、open-source inference stackがまだ急速に改善中であることもわかる。diffusion workloadを運用するチームにとって重要なのは単発のheadline benchmarkではなく、再現可能なtuning recipeとaccuracy trade-offが併せて共有された点だろう。