Hacker News Spots GreenBoost, a Linux stack that stretches GPU VRAM with system RAM and NVMe

Original: Nvidia greenboost: transparently extend GPU VRAM using system RAM/NVMe View original →

Read in other languages: 한국어日本語
LLM Mar 19, 2026 By Insights AI (HN) 2 min read Source

Why this HN submission mattered

On March 15, 2026, a Hacker News post about GreenBoost reached 124 points and 25 comments. The open-source project proposes a three-tier memory system for local AI workloads: keep hot data in GPU VRAM, spill colder allocations into system RAM, and use NVMe as a last-resort overflow tier. The pitch is simple: run larger LLMs on consumer hardware without rewriting the inference stack.

The README is framed around a specific frustration. The author wanted to run a 31.8 GB model on an RTX 5070 with 12 GB of VRAM. CPU offload was too slow, smaller quantization reduced quality, and upgrading to a much larger GPU was too expensive. GreenBoost is presented as an attempt to keep the GPU in the loop by letting CUDA-visible allocations extend beyond native VRAM.

How the design works

The project has two main pieces. A Linux kernel module allocates pinned DDR memory, exports it as DMA-BUF, and lets the GPU import it as CUDA external memory. A userspace shim injected through LD_PRELOAD intercepts allocation calls such as cudaMalloc and cudaMallocAsync, sending large allocations to the extended pool while leaving smaller ones alone. The README also says the shim hooks symbol resolution so apps like Ollama report the larger memory budget correctly.

The tiering model is the core idea: 12 GB of local VRAM at roughly 336 GB/s for hot layers, 51 GB of DDR4 over PCIe 4.0 for colder weights and KV cache, and 64 GB of NVMe as a safety valve. That does not make RAM or storage behave like true VRAM. What it does promise is a smoother way to trade bandwidth for capacity while keeping existing CUDA applications largely unchanged.

Why people are watching it

Hacker News reacts strongly to tools that make local AI cheaper, and GreenBoost lands squarely in that zone. It is Linux-only, highly experimental, and tightly coupled to low-level CUDA behavior, so nobody should treat it as a drop-in mainstream solution yet. But as a piece of systems engineering, it is interesting because it tries to attack the real bottleneck in consumer LLM inference: memory capacity, not just raw compute.

If the approach proves stable across more workloads, projects like this could matter for developers trying to squeeze more useful work out of midrange GPUs. Even if it stays niche, the repo is a clear sign of how aggressive the local-model community has become about bending memory hierarchies to fit frontier-sized workloads onto smaller boxes.

Primary source: GitLab repository. Community discussion: Hacker News.

Share: Long

Related Articles

LLM sources.twitter Mar 4, 2026 1 min read

NVIDIA AI Developer says a collaboration with SGLang achieved up to 25x faster DeepSeek R1 inference on GB300 NVL72 versus H200 and an 8x GB200 NVL72 gain within months. The post attributes gains to NVFP4 precision, disaggregation, and communication-compute overlap.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.