Running Llama 3.1 70B on a Single RTX 3090 via NVMe-to-GPU

70B Model Inference on a Single Consumer GPU

An open-source project called ntransformer, shared on Hacker News, demonstrates running Llama 3.1 70B on a single RTX 3090 GPU with 24GB of VRAM. A 70B parameter model typically requires around 140GB of memory — far beyond what any consumer GPU offers.

The Core Technique: NVMe-to-GPU Direct Transfer

The key innovation is bypassing CPU RAM entirely. Standard model inference loads weights through: storage → CPU RAM → GPU VRAM. ntransformer instead streams weights directly from NVMe SSD to GPU VRAM.

Eliminates CPU memory as a bottleneck
Leverages NVMe's high bandwidth directly
Loads only the currently needed layers into GPU memory (layer-by-layer streaming)

Implications

This approach makes large model experimentation accessible to developers with high-end consumer GPUs who lack access to expensive server hardware. Inference speed is slower than having the full model resident in VRAM, but the accessibility improvement is significant.

The project is available as open source on GitHub. It received 233 upvotes on Hacker News, reflecting strong interest in democratizing access to large language models.

LLM Reddit May 14, 2026 1 min read

TextGen Becomes a Native Desktop App: Open-Source LM Studio Alternative Evolves

The popular text-generation-webui project, rebranded as TextGen, has relaunched as a no-install native desktop app for Windows, Linux, and macOS. Built on a minimal Electron integration, it positions itself as a fully open-source alternative to LM Studio.

#textgen #local-llm #open-source

LLM Hacker News May 16, 2026 1 min read

Orthrus-Qwen3 Delivers 7.8× Faster Inference With Identical Output

The Orthrus framework achieves up to 7.8× tokens per forward pass on Qwen3 models while maintaining a provably identical output distribution to the original. Its dual-view architecture shares a single KV cache between autoregressive and diffusion pathways.

#inference #qwen3 #speculative-decoding

LLM Reddit May 22, 2026 1 min read

Meta Sends Legal Notice to Heretic Open-Source AI Project Over Llama Derivatives

Meta's legal team sent a notice to the Heretic Free Software Project for distributing Llama model derivatives. Heretic responded with sardonic compliance — invoking Galileo — while immediately setting up a Codeberg mirror in Germany and announcing preservation measures.

#meta #llama #open-source