#llama-cpp

LLM Reddit 4d ago 2 min read

A Rust manga translator showed LocalLLaMA what local OCR plus LLMs can feel like

LocalLLaMA reacted because this was not just a translation app; it chained detection, visual OCR, inpainting, and local LLM choices into one workflow.

#llama-cpp #ocr #local-llm

LLM Reddit 4d ago 2 min read

llama.cpp --fit made LocalLLaMA rethink the VRAM wall

LocalLLaMA reacted because --fit challenged the old rule of thumb that anything outside VRAM means painfully slow inference.

#llama-cpp #local-llm #vram

LLM Reddit Apr 19, 2026 1 min read

A Qwen3.6 tuning post made --n-cpu-moe the LocalLLaMA knob of the day

r/LocalLLaMA cared because the numbers were concrete: 79 t/s on an RTX 5070 Ti with 128K context, tied to one llama.cpp flag choice.

#qwen #llama-cpp #local-llm

LLM Reddit Apr 16, 2026 2 min read

LocalLLaMA Finds a Practical Speed Trick in Caching Hot MoE Experts in VRAM

LocalLLaMA reacted because the post attacks a very real pain point for running large MoE models on limited VRAM. The author tested a llama.cpp fork that tracks recently routed experts and keeps the hot ones in VRAM for Qwen3.5-122B-A10B, reporting 26.8% faster token generation than layer-based offload at a similar 22GB VRAM budget.

#local-llm #llama-cpp #moe

LLM Reddit Apr 16, 2026 2 min read

LocalLLaMA Gets Excited About an LLM That Tunes Its Own llama.cpp Flags

LocalLLaMA reacted because the joke-like idea of an LLM tuning its own runtime came with concrete benchmark numbers. The author says llm-server v2 adds --ai-tune, feeding llama-server help into a tuning loop that searches flag combinations and caches the fastest config; on their rig, Qwen3.5-27B Q4_K_M moved from 18.5 tok/s to 40.05 tok/s.

#local-llm #llama-cpp #optimization

LLM Hacker News Apr 16, 2026 2 min read

HN Turns the Ollama Backlash Into a Trust Check for Local LLM Tools

HN reacted because this was less about one wrapper and more about who gets credit and control in the local LLM stack. The Sleeping Robots post argues that Ollama won mindshare on top of llama.cpp while weakening trust through attribution, packaging, cloud routing, and model storage choices, while commenters pushed back that its UX still solved a real problem.

#local-llm #ollama #llama-cpp

LLM Reddit Apr 15, 2026 1 min read

Reddit sees a cleaner local speech stack as audio lands in llama-server with Gemma 4

LocalLLaMA jumped on this because native audio in llama-server promises a much cleaner speech workflow for local AI. The first wave of comments loves the idea of dropping the extra Whisper service, but it is also documenting where long-form audio still breaks.

#llama-cpp #speech-to-text #gemma

LLM Hacker News Apr 14, 2026 2 min read

Hacker News picks up a practical Gemma 4 local-agent recipe for moving Codex CLI off the cloud

Daniel Vaughan’s Gemma 4 writeup tests whether a local model can function as a real Codex CLI agent, with the answer depending less on benchmark claims than on very specific serving choices. The key lesson is that Apple Silicon required llama.cpp plus `--jinja`, KV-cache quantization, and `web_search = "disabled"`, while a GB10 box worked through Ollama 0.20.5.

#gemma-4 #codex-cli #local-llm

LLM Reddit Apr 13, 2026 2 min read

r/LocalLLaMA tracks the llama.cpp merge that brings in Qwen3 audio support

A 54-point Reddit post flagged merged PR #19441 as the moment qwen3-omni-moe and qwen3-asr support reached llama.cpp, with commenters focused on local multimodal and ASR use cases.

#qwen3 #llama-cpp #audio

LLM Reddit Apr 13, 2026 2 min read

LocalLLaMA Benchmark Claims Gemma 4 Speculative Decoding Gains of 29% on Average

A detailed `r/LocalLLaMA` benchmark reports that pairing `Gemma 4 31B` with `Gemma 4 E2B` as a draft model in `llama.cpp` lifted average throughput from `57.17 t/s` to `73.73 t/s`.

#speculative-decoding #gemma-4 #llama-cpp

LLM Reddit Apr 12, 2026 2 min read

LocalLLaMA Benchmarks Gemma 4 Speculative Decoding at a 29% Average Speedup

A new r/LocalLLaMA benchmark reports that Gemma 4 31B paired with an E2B draft model can gain about 29% average throughput, with code generation improving by roughly 50%.

#gemma-4 #speculative-decoding #llama-cpp

LLM Reddit Apr 12, 2026 2 min read

A Gemma 4 26B User Pushes Local Context to 245K Tokens

A r/LocalLLaMA stress test claims Gemma 4 26B A4B remained coherent at roughly 94% of a 262,144-token context window in llama.cpp. The post is anecdotal, but it is valuable because it pairs the claim with concrete tuning details and failure modes.

#localllm #gemma-4 #long-context