#llama-cpp

LLM Reddit May 22, 2026 1 min read

110 tok/s on a 35B Model with 12GB VRAM Using ik_llama.cpp

A community user achieved 110 tokens/second running Qwen3.6 35B A3B on an RTX 4070 Super 12GB via ik_llama.cpp, a fork with superior CPU offload optimization that significantly outperforms upstream llama.cpp's Multi-Token Prediction implementation.

#llama-cpp #qwen #local-llm

LLM Reddit May 12, 2026 1 min read

Discontinued Intel Optane Memory Runs 1 Trillion Parameter LLM Locally at 4 Tokens/Sec

A LocalLLaMA user built a 768GB RAM system using discontinued Intel Optane Persistent Memory from the secondhand market, running the 1-trillion-parameter Kimi K2.5 model locally at over 4 tokens per second.

#intel-optane #local-llm #llama-cpp

LLM Reddit May 10, 2026 1 min read

Running Qwen3.6 35B A3B at 80+ tok/sec on 12GB VRAM With llama.cpp MTP

A LocalLLaMA user shares their config for running Qwen3.6 35B A3B at over 80 tok/sec with 128K context on a 12GB VRAM GPU, using llama.cpp's Multi-Token Prediction support and achieving 80%+ draft acceptance rate.

#local-llm #qwen #llama-cpp

LLM Reddit May 4, 2026 1 min read

Llama.cpp Multi-Token Prediction Support Enters Beta, Closing the vLLM Performance Gap

llama.cpp's Multi-Token Prediction (MTP) support has entered beta, currently covering Qwen3.5 MTP. Combined with maturing tensor-parallel support, most token generation speed gaps between llama.cpp and vLLM are expected to close.

#llama-cpp #mtp #local-llm

LLM Reddit Apr 29, 2026 2 min read

Qwen 3.6 27B’s quant test gave LocalLLaMA a favorite, and a methodology fight

The community liked this post for the same reason it immediately started arguing with it: it had real numbers. Q4_K_M came out looking like the practical sweet spot, but commenters quickly pushed on error bars, KV-cache settings, and whether the reported scores made sense at all.

#qwen #gguf #quantization

LLM Reddit Apr 28, 2026 3 min read

LocalLLaMA’s Budget VRAM Trick: Add an Old GPU to Keep 27B Models Off the CPU

LocalLLaMA latched onto a very concrete claim: if a 27B model fits entirely in VRAM across two mismatched cards, even a weak second GPU can be better than spilling into system RAM for long-context decoding.

#local-llms #vram #multi-gpu

LLM Reddit Apr 22, 2026 2 min read

A Rust manga translator showed LocalLLaMA what local OCR plus LLMs can feel like

LocalLLaMA reacted because this was not just a translation app; it chained detection, visual OCR, inpainting, and local LLM choices into one workflow.

#llama-cpp #ocr #local-llm

LLM Reddit Apr 22, 2026 2 min read

llama.cpp --fit made LocalLLaMA rethink the VRAM wall

LocalLLaMA reacted because --fit challenged the old rule of thumb that anything outside VRAM means painfully slow inference.

#llama-cpp #local-llm #vram

LLM Reddit Apr 19, 2026 1 min read

A Qwen3.6 tuning post made --n-cpu-moe the LocalLLaMA knob of the day

r/LocalLLaMA cared because the numbers were concrete: 79 t/s on an RTX 5070 Ti with 128K context, tied to one llama.cpp flag choice.

#qwen #llama-cpp #local-llm

LLM Reddit Apr 16, 2026 2 min read

LocalLLaMA Finds a Practical Speed Trick in Caching Hot MoE Experts in VRAM

LocalLLaMA reacted because the post attacks a very real pain point for running large MoE models on limited VRAM. The author tested a llama.cpp fork that tracks recently routed experts and keeps the hot ones in VRAM for Qwen3.5-122B-A10B, reporting 26.8% faster token generation than layer-based offload at a similar 22GB VRAM budget.

#local-llm #llama-cpp #moe

LLM Reddit Apr 16, 2026 2 min read

LocalLLaMA Gets Excited About an LLM That Tunes Its Own llama.cpp Flags

LocalLLaMA reacted because the joke-like idea of an LLM tuning its own runtime came with concrete benchmark numbers. The author says llm-server v2 adds --ai-tune, feeding llama-server help into a tuning loop that searches flag combinations and caches the fastest config; on their rig, Qwen3.5-27B Q4_K_M moved from 18.5 tok/s to 40.05 tok/s.

#local-llm #llama-cpp #optimization

LLM Hacker News Apr 16, 2026 2 min read

HN Turns the Ollama Backlash Into a Trust Check for Local LLM Tools

HN reacted because this was less about one wrapper and more about who gets credit and control in the local LLM stack. The Sleeping Robots post argues that Ollama won mindshare on top of llama.cpp while weakening trust through attribution, packaging, cloud routing, and model storage choices, while commenters pushed back that its UX still solved a real problem.

#local-llm #ollama #llama-cpp