Discontinued Intel Optane Memory Runs 1 Trillion Parameter LLM Locally at 4 Tokens/Sec

The Build

A post on r/LocalLLaMA detailed a custom system using Intel Optane Persistent Memory (PMem) to run Kimi K2.5 - a 1 trillion parameter model - locally at over 4 tokens per second. The post gathered 677 upvotes, with the community particularly interested in the novel use of discontinued hardware.

What Intel Optane PMem Is

Intel Optane Persistent Memory is a DIMM-form-factor module that sits between DRAM and SSDs in the memory hierarchy. Intel discontinued the product line, which means secondhand Optane sticks are now available at a fraction of equivalent DRAM capacity costs. This builder assembled 768GB of effective RAM using PMem in Memory Mode, where the Optane serves as system RAM with standard DRAM sticks acting as a cache layer.

How the Model Runs

Kimi K2.5's mixture-of-experts (MoE) architecture made it well-suited for this setup. Using llama.cpp's hybrid GPU/CPU inference, the builder placed attention weights, the dense layer, and shared expert components on a 12GB GPU, with the bulk of sparse expert weights living in the Optane PMem.

Why This Matters

Running trillion-parameter models locally has until now required datacenter-class hardware. This build demonstrates that creative use of secondhand discontinued hardware can bring that capability to a single workstation, opening a path for more researchers to work with frontier-scale models locally.

LLM Reddit 2d ago 1 min read

Running Qwen3.6 35B A3B at 80+ tok/sec on 12GB VRAM With llama.cpp MTP

A LocalLLaMA user shares their config for running Qwen3.6 35B A3B at over 80 tok/sec with 128K context on a 12GB VRAM GPU, using llama.cpp's Multi-Token Prediction support and achieving 80%+ draft acceptance rate.

#local-llm #qwen #llama-cpp

LLM Reddit Apr 13, 2026 2 min read

LocalLLaMA Benchmark Claims Gemma 4 Speculative Decoding Gains of 29% on Average

A detailed `r/LocalLLaMA` benchmark reports that pairing `Gemma 4 31B` with `Gemma 4 E2B` as a draft model in `llama.cpp` lifted average throughput from `57.17 t/s` to `73.73 t/s`.

#speculative-decoding #gemma-4 #llama-cpp

LLM Hacker News Apr 14, 2026 2 min read

Hacker News picks up a practical Gemma 4 local-agent recipe for moving Codex CLI off the cloud

Daniel Vaughan’s Gemma 4 writeup tests whether a local model can function as a real Codex CLI agent, with the answer depending less on benchmark claims than on very specific serving choices. The key lesson is that Apple Silicon required llama.cpp plus `--jinja`, KV-cache quantization, and `web_search = "disabled"`, while a GB10 box worked through Ollama 0.20.5.

#gemma-4 #codex-cli #local-llm