Google will pay SpaceX $920M per month from October 2026 through June 2029 for access to about 110,000 NVIDIA GPUs and related compute. The deal shows how fast AI demand can pressure even one of the world’s largest infrastructure operators.
#gpu
RSS FeedLocalLLaMA readers noticed the infrastructure lesson: Zai claimed 15% more GPU inference throughput and 40.6% lower first-token P99 latency with the same GPUs, model, and software stack.
ZOZO’s ppf-contact-solver brings a production-grade cloth and soft-body contact engine into the open. The headline number is more than 180 million contacts in one scene, plus Blender support and an Apache 2.0 license.
A community user achieved 110 tokens/second running Qwen3.6 35B A3B on an RTX 4070 Super 12GB via ik_llama.cpp, a fork with superior CPU offload optimization that significantly outperforms upstream llama.cpp's Multi-Token Prediction implementation.
A Tom's Hardware survey reveals 60% of PC gamers won't build a new system in the next two years, as AI infrastructure demand has caused RAM prices to triple and GPU costs to surge significantly.
AMD has officially confirmed FSR Upscaling 4.1 will arrive on Radeon RX 7000 GPUs in July 2026, with RX 6000 series support following in 2027. The news extends AI-enhanced upscaling to a broader range of AMD hardware.
A co-statement dated April 29 says Palit now handles GALAX operations and customer support. Existing owners are being directed to Palit’s RMA channels while the previous GALAX structure has been shut down.
HN found this interesting because it tests a real boundary: whether Apple Silicon unified memory can make a Wasm sandbox and a GPU buffer operate on the same bytes.
Why it matters: Cloudflare is attacking the memory-bandwidth bottleneck in LLM serving rather than only buying more GPUs. Its post reports 15-22% model-size reduction, about 3 GB VRAM saved on Llama 3.1 8B, and open-sourced GPU kernels.
Hugging Face is trying to turn optimized GPU code into a Hub-native artifact, removing one of the messier deployment steps for PyTorch users. Clement Delangue says the new Kernels flow ships precompiled binaries matched to a specific GPU, PyTorch build, and OS, with claimed 1.7x to 2.5x speedups over PyTorch baselines.
A front-page Hacker News discussion resurfaced an EE Times interview outlining how AMD wants ROCm, Triton, OneROCm, and an open-source release model to chip away at CUDA dependence. The real test is not a headline compatibility claim, but whether stacks like vLLM and SGLang work in a boring, dependable way.
A MachineLearning thread argues that cuBLAS may be choosing an inefficient kernel for batched FP32 matrix multiplication on RTX 5090. The significance is not just the claimed slowdown, but the fact that the post includes reproducible benchmark tables, profiling notes, and linked repro material.