LocalLLaMA Pushes GreenBoost, a Linux Driver That Extends NVIDIA GPU Memory with RAM and NVMe
Original: Open-Source "GreenBoost" Driver Aims To Augment NVIDIA GPUs vRAM With System RAM & NVMe To Handle Larger LLMs View original →
GreenBoost is exactly the kind of infrastructure idea that LocalLLaMA notices quickly because it targets one of the ecosystem’s hardest practical limits: not enough GPU memory. At crawl time the Reddit thread had 141 upvotes and 38 comments. The linked Phoronix report was published on March 14, 2026, and described GreenBoost as an independently developed open-source Linux kernel module meant to augment NVIDIA GPU memory with system RAM and NVMe storage for larger LLM workloads.
According to Phoronix, GreenBoost does not replace NVIDIA’s official Linux driver stack. Instead it works alongside it through a dedicated kernel module, greenboost.ko, plus a CUDA user-space shim. The kernel side allocates pinned DDR4 pages with the buddy allocator, exports them as DMA-BUF file descriptors, and lets the GPU import them as CUDA external memory. The article says PCIe 4.0 x16 handles the actual data movement, while a sysfs interface and watchdog thread monitor RAM and NVMe pressure.
Why the design caught the community’s attention
- The CUDA shim can let small allocations pass through normally while redirecting larger ones, such as overflowing model weights or KV cache, into the expanded memory path.
- The user-space layer hooks allocation calls and even symbol lookups so applications like Ollama can see a larger usable pool without direct application changes.
- The developer’s motivating example was trying to run a 31.8 GB model on a GeForce RTX 5070 with 12 GB of dedicated vRAM.
That makes the project interesting for local LLM operators because the usual fallback options are all painful. Offloading layers to system memory can crush throughput, while heavier quantization can reduce quality. GreenBoost proposes a different tradeoff by treating the storage hierarchy more aggressively as part of the usable GPU memory surface. Whether that pays off in practice will depend on bandwidth, latency, and workload shape, and the code is clearly experimental. But the enthusiasm on LocalLLaMA is easy to understand. The memory ceiling on consumer GPUs is still one of the biggest reasons people cannot run the models they want at the precision they want.
Source: Phoronix · Code: GitLab · Community discussion: r/LocalLLaMA
Related Articles
HN did not push Browser Harness because it was another browser wrapper. It took off because the repo lets an LLM patch its own browser helpers in the middle of a task, trading safety rails for raw flexibility.
Multimodal agents still pay a tax for chaining separate vision, audio, and text models. NVIDIA says Nemotron 3 Nano Omni collapses that stack into a 30B model with 256K context and up to 9.2x higher effective video system capacity at the same responsiveness target.
LocalLLaMA latched onto a very concrete claim: if a 27B model fits entirely in VRAM across two mismatched cards, even a weak second GPU can be better than spilling into system RAM for long-context decoding.
Comments (0)
No comments yet. Be the first to comment!