Running Llama 3.1 70B on a Single RTX 3090 via NVMe-to-GPU
Original: Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU View original →
70B Model Inference on a Single Consumer GPU
An open-source project called ntransformer, shared on Hacker News, demonstrates running Llama 3.1 70B on a single RTX 3090 GPU with 24GB of VRAM. A 70B parameter model typically requires around 140GB of memory — far beyond what any consumer GPU offers.
The Core Technique: NVMe-to-GPU Direct Transfer
The key innovation is bypassing CPU RAM entirely. Standard model inference loads weights through: storage → CPU RAM → GPU VRAM. ntransformer instead streams weights directly from NVMe SSD to GPU VRAM.
- Eliminates CPU memory as a bottleneck
- Leverages NVMe's high bandwidth directly
- Loads only the currently needed layers into GPU memory (layer-by-layer streaming)
Implications
This approach makes large model experimentation accessible to developers with high-end consumer GPUs who lack access to expensive server hardware. Inference speed is slower than having the full model resident in VRAM, but the accessibility improvement is significant.
The project is available as open source on GitHub. It received 233 upvotes on Hacker News, reflecting strong interest in democratizing access to large language models.
Related Articles
The popular text-generation-webui project, rebranded as TextGen, has relaunched as a no-install native desktop app for Windows, Linux, and macOS. Built on a minimal Electron integration, it positions itself as a fully open-source alternative to LM Studio.
The Orthrus framework achieves up to 7.8× tokens per forward pass on Qwen3 models while maintaining a provably identical output distribution to the original. Its dual-view architecture shares a single KV cache between autoregressive and diffusion pathways.
Meta's legal team sent a notice to the Heretic Free Software Project for distributing Llama model derivatives. Heretic responded with sardonic compliance — invoking Galileo — while immediately setting up a Codeberg mirror in Germany and announcing preservation measures.