#local-inference

LLM Reddit 10h ago 2 min read

Qwen3.6 27B Hits 100 tps on One RTX 5090, and LocalLLaMA Immediately Asks About Quality

LocalLLaMA was interested for a reason beyond a flashy speed number. A post claiming 105-108 tps and a full 256k native context window for Qwen3.6-27B-INT4 on a single RTX 5090 turned the thread into a practical discussion about how much quality survives once local inference gets this fast.

#qwen #vllm #rtx-5090

LLM Reddit 1d ago 2 min read

LocalLLaMA Sees a New Local Bar: Qwen 3.6 27B at ~80 t/s on One RTX 5090

r/LocalLLaMA reacted because this was not just another “new model out” post. The claim was concrete: Qwen3.6-27B running at about 80 tokens per second with a 218k context window on a single RTX 5090 via vLLM 0.19.

#qwen #vllm #rtx-5090

LLM Reddit Apr 12, 2026 2 min read

r/LocalLLaMA Treats MiniMax M2.7 as More Than a Chat Model

A r/LocalLLaMA thread quickly elevated MiniMax M2.7 because the Hugging Face release is framed less as a chat model and more as an agent system with tool use, Agent Teams, and ready-made deployment guides. Early interest is as much about operational packaging as about the benchmark numbers themselves.

#llm #agents #tool-use

LLM Reddit Apr 3, 2026 2 min read

r/LocalLLaMA Turns Gemma 4 Into a Major Local-Model Discussion

r/LocalLLaMA pushed Gemma 4 into one of the strongest community signals in this crawl as Google shipped an open model family spanning edge devices through workstation-class local servers.

#gemma #google #open-models

LLM Reddit Apr 1, 2026 2 min read

PrismML introduces 1-bit Bonsai for edge-ready LLM deployment

A well-received r/LocalLLaMA post spotlighted PrismML’s 1-bit Bonsai launch, which claims to shrink an 8.2B model to 1.15GB with an end-to-end 1-bit design. The pitch is not just compression, but practical on-device throughput and energy efficiency.

#prismml #1-bit-llm #edge-ai

LLM Reddit Mar 31, 2026 2 min read

r/LocalLLaMA Reacts to CoPaw-9B With Interest in Small Agent Models

A Reddit thread in r/LocalLLaMA drew 142 upvotes and 29 comments around CoPaw-9B. The discussion focused on its Qwen3.5-based 9B agent positioning, 262,144-token context window, and whether local users would get GGUF or other quantized builds quickly.

#llm #qwen #agentic

LLM Hacker News Mar 28, 2026 2 min read

Hacker News spotlights ATLAS and the economics of local coding agents

A Hacker News post pushed ATLAS into the spotlight by framing a consumer-GPU coding agent as a serious cost challenger to hosted systems. The headline benchmark is interesting, but the repository itself makes clear that its 74.6% result is not a controlled head-to-head against Claude 4.5 Sonnet because the task counts and evaluation protocols differ.

#coding-agents #benchmarks #local-inference

LLM Reddit Mar 25, 2026 2 min read

r/artificial highlights ATLAS reaching 74.6% LiveCodeBench on a $500 GPU

r/artificial focused on ATLAS because it shows how planning, verification, and repair infrastructure can push a frozen 14B local model far closer to frontier coding performance.

#atlas #livecodebench #local-inference

LLM Hacker News Mar 22, 2026 2 min read

Flash-MoE: Running a 397B Parameter Model on a Laptop

Flash-MoE is a C and Metal inference engine that claims to run Qwen3.5-397B-A17B on a 48 GB MacBook Pro. The key idea is to keep a 209 GB MoE model on SSD and stream only the active experts needed for each token.

#llm #moe #metal

LLM Reddit Mar 21, 2026 3 min read

r/LocalLLaMA Spots Native MTP for Qwen3.5 in mlx-lm and Faster Single-Stream Inference

A Reddit thread in r/LocalLLaMA spotlighted mlx-lm PR #990, which uses Qwen3.5's built-in MTP head for native speculative decoding and reports 15.3 -> 23.3 tok/s (~1.5x throughput boost) with ~80.6% acceptance rate on Qwen3.5-27B 4-bit on an M4 Pro. The gain is meaningful, but so are the constraints around converted checkpoints, disabled batching, and untested MoE variants.

#mlx-lm #qwen3.5 #mtp

LLM Reddit Mar 17, 2026 2 min read

Unsloth Studio beta goes after the local model workflow in one interface

A high-engagement r/LocalLLaMA post highlighted Unsloth Studio, a beta open-source web UI that aims to train, run, and export open models from one local interface. The discussion framed it as a possible LM Studio challenger in the GGUF ecosystem, while top commenters noted that many advanced users still lean on vLLM or direct llama.cpp workflows.

#llm #unsloth #gguf

AI Reddit Mar 14, 2026 2 min read

r/LocalLLaMA flags Tenstorrent QuietBox 2 as a desk-side RISC-V box for local AI inference

r/LocalLLaMA highlighted Tenstorrent's desk-side TT-QuietBox 2, a liquid-cooled RISC-V inference workstation aimed at 120B-scale local AI workloads. The launch combines open tooling, a standard 120V power target, and ambitious performance claims that Reddit immediately debated.

#tenstorrent #risc-v #ai-hardware