A March 2026 r/LocalLLaMA post with 126 points and 45 comments highlighted a practical guide for running Qwen3.5-27B through llama.cpp and wiring it into OpenCode. The post stands out because it covers the operational details that usually break local coding setups: quant choice, chat-template fixes, VRAM budgeting, Tailscale networking, and tool-calling behavior.
#llama-cpp
RSS FeedA LocalLLaMA self-post shared an open-source TurboQuant implementation for llama.cpp that skips value dequantization when attention weights are negligible. The author reports a 22.8% decode gain at 32K context on Qwen3.5-35B-A3B over Apple M5 Max, with unchanged perplexity and better needle-in-a-haystack retrieval.
A March 17, 2026 r/LocalLLaMA post about Hugging Face hf-agents reached 624 points and 78 comments at crawl time. The extension uses llmfit to detect hardware, recommends a runnable model and quant, starts llama.cpp, and launches the Pi coding agent.
A March 17, 2026 r/LocalLLaMA post about Unsloth Studio reached 898 points and 236 comments in the latest available crawl. Unsloth positions Studio as a beta web UI that combines local inference, dataset generation, fine-tuning, code execution, and export in one interface.
A r/LocalLLaMA field report showed how a very specific local inference workload was tuned for throughput. The author reported about 2,000 tokens per second while classifying markdown documents with Qwen 3.5 27B, and the comment thread turned the post into a practical optimization discussion.
A well-received PSA on r/LocalLLaMA argues that convenience layers such as Ollama and LM Studio can change model behavior enough to distort evaluation. The more durable lesson from the thread is reproducibility: hold templates, stop tokens, sampling, runtime versions, and quantization constant before judging a model.
A LocalLLaMA thread highlighted ongoing work to add NVFP4 quantization support to llama.cpp GGUF, pointing to potential memory savings and higher throughput for compatible GPU setups.
A high-engagement LocalLLaMA follow-up benchmark reports that Qwen3.5-35B-A3B runs best on the tested RTX 5080 setup with Q4_K_M quantization, KV q8_0, and --fit without explicit batch flags.
A high-engagement r/LocalLLaMA thread reports strong early results for Qwen3.5-35B-A3B in local agentic coding workflows. The original poster cites 100+ tokens/sec on a single RTX 3090 setup, while comments show mixed reproducibility and emphasize tooling, quantization, and prompt pipeline differences.
A high-signal LocalLLaMA thread points to llama.cpp Discussion #19759, where maintainers say the ggml team is joining Hugging Face while continuing full-time support for ggml and llama.cpp.
A technical r/LocalLLaMA thread pointed to llama.cpp PR #19765, merged on February 20, 2026. The patch unifies parser paths as a stop-gap for Qwen3-Coder-Next issues and adds parallel tool-calling plus JSON schema fixes.
A high-scoring Hacker News thread highlighted announcement #19759 in ggml-org/llama.cpp: the ggml.ai founding team is joining Hugging Face, while maintainers state ggml/llama.cpp will remain open-source and community-driven.