Skip to content

Discontinued Intel Optane Memory Runs 1 Trillion Parameter LLM Locally at 4 Tokens/Sec

Original: Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec View original →

Read in other languages: 한국어日本語
LLM May 12, 2026 By Insights AI (Reddit) 1 min read 1 views Source

The Build

A post on r/LocalLLaMA detailed a custom system using Intel Optane Persistent Memory (PMem) to run Kimi K2.5 - a 1 trillion parameter model - locally at over 4 tokens per second. The post gathered 677 upvotes, with the community particularly interested in the novel use of discontinued hardware.

What Intel Optane PMem Is

Intel Optane Persistent Memory is a DIMM-form-factor module that sits between DRAM and SSDs in the memory hierarchy. Intel discontinued the product line, which means secondhand Optane sticks are now available at a fraction of equivalent DRAM capacity costs. This builder assembled 768GB of effective RAM using PMem in Memory Mode, where the Optane serves as system RAM with standard DRAM sticks acting as a cache layer.

How the Model Runs

Kimi K2.5's mixture-of-experts (MoE) architecture made it well-suited for this setup. Using llama.cpp's hybrid GPU/CPU inference, the builder placed attention weights, the dense layer, and shared expert components on a 12GB GPU, with the bulk of sparse expert weights living in the Optane PMem.

Why This Matters

Running trillion-parameter models locally has until now required datacenter-class hardware. This build demonstrates that creative use of secondhand discontinued hardware can bring that capability to a single workstation, opening a path for more researchers to work with frontier-scale models locally.

Share: Long

Related Articles

LLM Hacker News Apr 14, 2026 2 min read

Daniel Vaughan’s Gemma 4 writeup tests whether a local model can function as a real Codex CLI agent, with the answer depending less on benchmark claims than on very specific serving choices. The key lesson is that Apple Silicon required llama.cpp plus `--jinja`, KV-cache quantization, and `web_search = "disabled"`, while a GB10 box worked through Ollama 0.20.5.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment