#llm

LLM Mar 25, 2026 2 min read

Anthropic Ships Claude Sonnet 4.6 With 1M-Token Context in Beta

Anthropic introduced Claude Sonnet 4.6 on Feb 17, 2026 as its most capable Sonnet model yet. The release combines a 1M token context window in beta with upgrades to coding, computer use, and agent workflows while keeping Sonnet 4.5 pricing.

#anthropic #claude #llm

LLM Mar 23, 2026 2 min read

Anthropic launches Claude Sonnet 4.6 with 1M-token context and broader coding gains

Anthropic announced Claude Sonnet 4.6 on February 17, 2026. The release combines a 1M-token context beta, unchanged pricing, and broader upgrades across coding, computer use, and long-context reasoning.

#anthropic #claude #llm

LLM Hacker News Mar 23, 2026 2 min read

Flash-MoE Shows 397B Qwen Inference on a 48GB MacBook Pro

A Hacker News discussion highlighted Flash-MoE, a pure C/Metal inference stack that streams Qwen3.5-397B-A17B from SSD and reaches interactive speeds on a 48GB M3 Max laptop.

#llm #mixture-of-experts #metal

LLM Hacker News Mar 23, 2026 2 min read

Hacker News debates a no-training LLM trick that duplicates layers to improve reasoning

A Show HN post points to llm-circuit-finder, a toolkit that duplicates selected transformer layers inside GGUF models and claims sizable reasoning gains without changing weights or running fine-tuning. The strongest benchmark numbers come from the project author’s own evaluations rather than independent validation.

#llm #reasoning #benchmark

LLM Hacker News Mar 23, 2026 2 min read

Hacker News spots OpenCode, an open-source AI coding agent built for terminal, IDE, and desktop

OpenCode drew 1,238 points and 614 comments on Hacker News, highlighting an open-source AI coding agent that spans terminal, IDE, and desktop clients. The project site emphasizes broad provider support, LSP integration, multi-session workflows, and a privacy-first posture.

#coding-agent #developer-tools #open-source

LLM Mar 22, 2026 2 min read

Google rolls out Gemini 3.1 Flash-Lite preview for high-volume, cost-sensitive LLM workloads

Google has introduced Gemini 3.1 Flash-Lite in preview through Google AI Studio and Vertex AI. The company is positioning it as the fastest and most cost-efficient model in the Gemini 3 family for large-scale inference jobs.

#google #gemini #llm

LLM Hacker News Mar 22, 2026 2 min read

Flash-MoE: Running a 397B Parameter Model on a Laptop

Flash-MoE is a C and Metal inference engine that claims to run Qwen3.5-397B-A17B on a 48 GB MacBook Pro. The key idea is to keep a 209 GB MoE model on SSD and stream only the active experts needed for each token.

#llm #moe #metal

LLM Hacker News Mar 21, 2026 3 min read

HN Examines llm-circuit-finder: Layer Duplication as Capability Steering, Not a Free LLM Upgrade

A Show HN repo claims that duplicating a few LLM layers can improve reasoning without training or weight changes. The underlying README, however, shows real tradeoffs, making this more convincing as capability steering than as a universal model upgrade.

#llm #reasoning #benchmarks

LLM Reddit Mar 21, 2026 1 min read

r/LocalLLaMA Spots Mistral 4 Landing in Transformers with 119B MoE and 256k Context

A merged Hugging Face Transformers PR surfaced on r/LocalLLaMA shows Mistral 4 as a hybrid instruct/reasoning model with 128 experts, 4 active experts, 6.5B activated parameters per token, 256k context, and Apache 2.0 licensing.

#llm #mistral #open-models

LLM Reddit Mar 21, 2026 1 min read

r/LocalLLaMA Flags NVIDIA’s Nemotron-Cascade-2-30B-A3B as an Open 30B MoE Reasoning Model

The LocalLLaMA discussion around NVIDIA’s new model focused on an unusual mix of scale efficiency and benchmark ambition: 30B total parameters, 3B activated, plus separate thinking and instruct modes.

#llm #reasoning #open-models

LLM Hacker News Mar 21, 2026 2 min read

Hacker News Tracks Moonshot AI’s Attention Residuals as a Drop-In Upgrade for Transformer Depth

The March 20, 2026 HN discussion around Attention Residuals focused on a simple claim with large implications: replace fixed residual addition with learned depth-wise attention and recover performance with modest overhead.

#llm #transformers #research

LLM Hacker News Mar 20, 2026 2 min read

Hacker News Examines NanoGPT Slowrun's 10x Data-Efficiency Claim

Q Labs says 100M tokens and an 18B-parameter ensemble can match a 1B-token baseline, and Hacker News immediately focused on whether that gain survives serving and deployment.

#llm #training #scaling-laws