LocalLLaMA’s Budget VRAM Trick: Add an Old GPU to Keep 27B Models Off the CPU

Original: To 16GB VRAM users, plug in your old GPU View original →

Read in other languages: 한국어日本語
LLM Apr 28, 2026 By Insights AI (Reddit) 3 min read Source

Why the thread resonated

LocalLLaMA loves expensive workstation builds, but this post moved because it offered a cheaper kind of optimism. The author was not showing off dual flagship cards. They were telling 16GB VRAM users to stop treating their old GPU as junk and start treating it as memory. The basic claim was simple: if you can keep a dense 27B model entirely on GPUs, even with a lopsided pair such as an RTX 5070 Ti 16GB plus an older RTX 2060 6GB, you can avoid the brutal slowdown that comes from pushing part of the model or KV cache into system RAM.

That framing gave the thread immediate traction because it speaks to the actual bottleneck of local inference in 2026. For many hobbyists, the question is not peak benchmark glamour. It is whether one more cheap card can make long-context use finally tolerable.

What the poster actually ran

The post used llama-server with a deliberately memory-conscious configuration: two devices enabled together, GPU layer preference pushed to the ceiling, no mmap, q8 KV cache, and a 128k context target. The author's practical point was that split-mode layer offload works even when the two cards are uneven. In the anecdotal run, with about 71k actual context, the setup produced around 186.76 tokens per second for prompt processing and 19.21 tokens per second for generation, which the poster contrasted with roughly 4 tokens per second when a single card setup was forced to drag too much through CPU memory.

They then posted more structured llama-bench numbers. On CUDA 12.4 at 8k context, generation moved from 16.54 t/s on the main card alone to 25.40 t/s with both GPUs. At 16k context, it moved from 12.03 t/s to 24.31 t/s. The message was not that two uneven cards are magically ideal. It was that staying inside VRAM can dominate the ugly cost of asymmetry.

Where the comments pushed back

The top reply immediately said the author should be using CUDA rather than Vulkan with NVIDIA cards. Another commenter agreed with the general principle that every bit of VRAM is usually better than RAM, and said they also enable an extra GPU only for the largest models. But the caveat showed up quickly too. One user with a 3090 Ti plus a 2070 said the weaker second card can bottleneck short-context performance badly, even while still helping compared with CPU offload at larger contexts.

That pushback made the thread more useful. The community was not treating this as a universal recipe. It was treating it as a tradeoff: sacrifice some balance and maybe some prompt speed, but rescue long-context generation from the much worse fate of falling back to RAM.

Why LocalLLaMA upvoted it

This thread landed because it matched the community's current mood. People are no longer only chasing bigger open models. They are hunting practical ways to fit those models on hardware they already own. The clever move here is conceptual as much as technical: an old gaming card is not necessarily extra compute, it is extra addressable model memory. For users stuck below 24GB, that is a much more actionable insight than another leaderboard screenshot.

Source: r/LocalLLaMA thread.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.