A new r/LocalLLaMA benchmark post says an M5 Max system pushed Qwen3.5-397B to 20.34 tok/s through SSD streaming, with I/O parallelism, temporal expert prediction, and Q3-GGUF experts doing most of the work.
LLM
RSS FeedPenfield Labs argues that LoCoMo still circulates as a major memory benchmark even though 99 of its 1,540 answer-key entries are score-corrupting and its gpt-4o-mini judge passed 62.81% of intentionally wrong answers in an audit.
A Hacker News thread turned Zach Manson's Copilot incident into a broader argument about whether coding assistants should be allowed to insert vendor messaging into PR text and other repo metadata.
NVIDIA announced Dynamo 1.0 on March 16, 2026 as a production-grade open-source layer for generative and agentic inference. The release matters because it ties Blackwell performance gains, lower token economics and native integration with major open-source frameworks into one operating model.
A March 1 r/MachineLearning post compared 94 LLM endpoints across 25 providers and argued that open models were closing to within a single-digit quality gap of top proprietary systems. The real takeaway is operational: model choice is now about intelligence, price, speed, and deployment freedom at the same time.
A March 2026 r/LocalLLaMA post with 123 points and 25 comments spotlighted `voxtral-voice-clone`, a project trying to train the missing codec encoder for Mistral’s Voxtral-4B-TTS-2603. The repo targets zero-shot cloning via `ref_audio`, which the original open-weight release could not support because the encoder weights were not included.
A March 2026 r/singularity post shared Google Research’s TurboQuant work and drew 114 points with 18 comments. Google says the method can shrink KV cache memory by at least 6x on needle tasks, quantize caches to 3 bits without training, and deliver up to 8x attention-logit speedups on H100 GPUs.
Mistral announced Mistral Small 4 on March 16, 2026 as a single open model that combines reasoning, multimodal input, and agentic coding. Key specs include 119B total parameters, 6B active parameters per token, a 256k context window, Apache 2.0 licensing, and configurable reasoning effort.
Mistral introduced Leanstral on March 16, 2026 as an open-source code agent built specifically for Lean 4. The release combines 6B active parameters, an Apache 2.0 license, a new FLTEval benchmark, and immediate availability in Mistral Vibe, API form, and downloadable weights.
OpenAIDevs pointed developers to Codex Security on March 29, 2026, positioning it as a way to find, validate, and remediate likely vulnerabilities in connected GitHub repositories. OpenAI's docs say the system scans commit by commit, uses repo-specific threat models, validates high-signal findings in an isolated environment, and can move reviewed findings toward GitHub pull requests.
A new r/MachineLearning post pushes TurboQuant beyond KV-cache talk and into weight compression, with a GitHub implementation that targets drop-in low-bit LLM inference.
A LocalLLaMA post points to IBM's Granite-4.0-3B-Vision, a compact VLM built for charts, tables, and document key-value extraction rather than generic multimodal chat.