Running Llama 3.1 70B on a Single RTX 3090 via NVMe-to-GPU
Original: Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU View original →
70B Model Inference on a Single Consumer GPU
An open-source project called ntransformer, shared on Hacker News, demonstrates running Llama 3.1 70B on a single RTX 3090 GPU with 24GB of VRAM. A 70B parameter model typically requires around 140GB of memory — far beyond what any consumer GPU offers.
The Core Technique: NVMe-to-GPU Direct Transfer
The key innovation is bypassing CPU RAM entirely. Standard model inference loads weights through: storage → CPU RAM → GPU VRAM. ntransformer instead streams weights directly from NVMe SSD to GPU VRAM.
- Eliminates CPU memory as a bottleneck
- Leverages NVMe's high bandwidth directly
- Loads only the currently needed layers into GPU memory (layer-by-layer streaming)
Implications
This approach makes large model experimentation accessible to developers with high-end consumer GPUs who lack access to expensive server hardware. Inference speed is slower than having the full model resident in VRAM, but the accessibility improvement is significant.
The project is available as open source on GitHub. It received 233 upvotes on Hacker News, reflecting strong interest in democratizing access to large language models.
Related Articles
In a February 12, 2026 post, NVIDIA said major inference providers are reducing token costs with open-source frontier models on Blackwell. The article includes partner-reported gains across healthcare, gaming, and enterprise support workloads.
A high-engagement Hacker News thread spotlights Taalas’ claim that model-specific silicon can cut inference latency and cost, including a hard-wired Llama 3.1 8B deployment reportedly reaching 17K tokens/sec per user.
A popular r/LocalLLaMA thread points to karpathy/autoresearch, a small open-source setup where an agent edits one training file, runs 5-minute experiments, and iterates toward lower validation bits per byte.
Comments (0)
No comments yet. Be the first to comment!