Running Llama 3.1 70B on a Single RTX 3090 via NVMe-to-GPU
Original: Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU View original →
70B Model Inference on a Single Consumer GPU
An open-source project called ntransformer, shared on Hacker News, demonstrates running Llama 3.1 70B on a single RTX 3090 GPU with 24GB of VRAM. A 70B parameter model typically requires around 140GB of memory — far beyond what any consumer GPU offers.
The Core Technique: NVMe-to-GPU Direct Transfer
The key innovation is bypassing CPU RAM entirely. Standard model inference loads weights through: storage → CPU RAM → GPU VRAM. ntransformer instead streams weights directly from NVMe SSD to GPU VRAM.
- Eliminates CPU memory as a bottleneck
- Leverages NVMe's high bandwidth directly
- Loads only the currently needed layers into GPU memory (layer-by-layer streaming)
Implications
This approach makes large model experimentation accessible to developers with high-end consumer GPUs who lack access to expensive server hardware. Inference speed is slower than having the full model resident in VRAM, but the accessibility improvement is significant.
The project is available as open source on GitHub. It received 233 upvotes on Hacker News, reflecting strong interest in democratizing access to large language models.
Related Articles
A new r/MachineLearning post pushes TurboQuant beyond KV-cache talk and into weight compression, with a GitHub implementation that targets drop-in low-bit LLM inference.
NVIDIA announced Dynamo 1.0 on March 16, 2026 as a production-grade open-source layer for generative and agentic inference. The release matters because it ties Blackwell performance gains, lower token economics and native integration with major open-source frameworks into one operating model.
A fresh LocalLLaMA thread argues that some early Gemma 4 failures are really inference-stack bugs rather than model quality problems. By linking active llama.cpp pull requests and user reports after updates, the post reframes launch benchmarks as a full-stack issue.
Comments (0)
No comments yet. Be the first to comment!