LLM Reddit Apr 28, 2026 3 min read
LocalLLaMA latched onto a very concrete claim: if a 27B model fits entirely in VRAM across two mismatched cards, even a weak second GPU can be better than spilling into system RAM for long-context decoding.
LocalLLaMA latched onto a very concrete claim: if a 27B model fits entirely in VRAM across two mismatched cards, even a weak second GPU can be better than spilling into system RAM for long-context decoding.
A high-scoring LocalLLaMA thread treated merged PR #19378 as a meaningful step toward more practical multi-GPU inference in llama.cpp. The catch is that the new <code>--split-mode tensor</code> path is still explicitly experimental, strongest today on CUDA, and still rough on ROCm and Vulkan.