#local-llm

LLM Hacker News Apr 14, 2026 2 min read

Hacker News picks up a practical Gemma 4 local-agent recipe for moving Codex CLI off the cloud

Daniel Vaughan’s Gemma 4 writeup tests whether a local model can function as a real Codex CLI agent, with the answer depending less on benchmark claims than on very specific serving choices. The key lesson is that Apple Silicon required llama.cpp plus `--jinja`, KV-cache quantization, and `web_search = "disabled"`, while a GB10 box worked through Ollama 0.20.5.

#gemma-4 #codex-cli #local-llm

LLM Reddit Apr 13, 2026 2 min read

LocalLLaMA Benchmark Claims Gemma 4 Speculative Decoding Gains of 29% on Average

A detailed `r/LocalLLaMA` benchmark reports that pairing `Gemma 4 31B` with `Gemma 4 E2B` as a draft model in `llama.cpp` lifted average throughput from `57.17 t/s` to `73.73 t/s`.

#speculative-decoding #gemma-4 #llama-cpp

LLM Reddit Apr 9, 2026 2 min read

Reddit Says Gemma 4 on llama.cpp Is Finally Stable, With Caveats

A high-scoring LocalLLaMA post argued that merging llama.cpp PR #21534 finally cleared the known Gemma 4 issues in current master. The community focus was not just the fix itself, but the operational details around tokenizer correctness, chat templates, memory flags, and the warning to avoid CUDA 13.2.

#gemma-4 #llama-cpp #tokenizer

LLM Reddit Apr 9, 2026 2 min read

Why Reddit Thinks Fresh Gemma 4 GGUF Downloads Matter

A LocalLLaMA post argues that recent llama.cpp fixes justify refreshed Gemma 4 GGUF downloads, especially for users relying on local inference pipelines.

#gemma-4 #gguf #llama-cpp

LLM Reddit Apr 9, 2026 2 min read

LocalLLaMA Says a Qwen 3.5 Chat Template Bug Is Quietly Killing Prefix-Cache Reuse

A practical Reddit debugging post argues that a Qwen 3.5 chat-template issue, not the inference engine itself, can invalidate prefix-cache reuse after tool-heavy turns and waste large amounts of compute.

#qwen-3.5 #prefix-caching #chat-template

LLM Reddit Apr 8, 2026 2 min read

r/LocalLLaMA argues Qwen3.5 27B is where local speed, quality, and hardware practicality meet

A recent r/LocalLLaMA post presents Qwen3.5 27B as an unusually strong local inference sweet spot. The author reports about 19.7 tokens per second on an RTX A6000 48GB with llama.cpp and a 32K context, while the comments turn into a detailed debate about dense-versus-MoE VRAM economics.

#qwen #local-llm #llama-cpp

LLM Reddit Apr 8, 2026 1 min read

r/LocalLLaMA Pushes Gemma 4 Local Fine-Tuning With an 8GB VRAM Guide and Bug Fixes

A high-signal r/LocalLLaMA thread is circulating practical Gemma 4 fine-tuning guidance from Unsloth. The post claims Gemma-4-E2B and E4B can be adapted locally with 8GB VRAM, about 1.5x faster training, roughly 60% less VRAM than FA2 setups, and several fixes for early Gemma 4 training and inference bugs.

#gemma-4 #fine-tuning #local-llm

LLM Reddit Apr 7, 2026 2 min read

LocalLLaMA User Says Gemma 4 26B A3B Finally Makes Local Tool Calling Feel Stable

A LocalLLaMA post with roughly 350 points argues that Gemma 4 26B A3B becomes unusually effective for local coding-agent and tool-calling workflows when paired with the right runtime settings, contrasting it with prompt-caching and function-calling issues the poster saw in other local-model setups.

#gemma-4 #local-llm #tool-calling

LLM Reddit Apr 7, 2026 2 min read

A LocalLLaMA Benchmark Suggests MoE Models Fit 32 GB Apple Laptops Well

A recent LocalLLaMA discussion shared results from Mac LLM Bench, an open benchmark workflow for Apple Silicon systems. The most useful takeaway is practical: dense 32B models hit a clear wall on a 32 GB MacBook Air M5, while some MoE models offer a much better latency-to-capability tradeoff.

#apple-silicon #benchmark #llama-cpp

LLM Reddit Apr 7, 2026 2 min read

A LocalLLaMA Experiment Put a Tiny LLM on a 1998 iMac G3 with 32 MB of RAM

A high-signal LocalLLaMA post described a port of llama2.c to classic Mac OS that runs Karpathy’s TinyStories 260K model on a stock iMac G3. The project is compelling because most of the work is systems engineering: endianness fixes, memory partition management, and layout debugging on vintage hardware.

#local-llm #retrocomputing #powerpc

LLM Hacker News Apr 4, 2026 2 min read

HN Focuses on a Practical Mac mini Setup for Ollama and Gemma 4

A practical HN gist lays out how to run Ollama and Gemma 4 on an Apple Silicon Mac mini, including auto-start, periodic preload, and `OLLAMA_KEEP_ALIVE=-1`. The author says `gemma4:26b` nearly exhausted 24GB unified memory, making the default 8B model a safer operational choice.

#ollama #gemma4 #mac-mini

LLM Reddit Apr 2, 2026 2 min read

Reddit tests PrismML’s Bonsai 1-bit models beyond the announcement hype

A strong r/LocalLLaMA reaction suggests PrismML’s Bonsai launch is landing as more than another compression headline. The discussion combines the company’s end-to-end 1-bit claims with early hands-on reports that the models feel materially more usable than earlier BitNet-style experiments.

#bonsai #1-bit #edge-ai