LocalLLaMA did not just cheer the number. The moment 80 tps and a 218k context window appeared, the thread shifted to prompt length, quantization tradeoffs, and whether the vLLM setup really holds up in practice.
#long-context
RSS FeedHN did not latch onto DeepSeek V4 because of a polished launch page. The thread took off when commenters realized the front-page link was just updated docs while the weights and base models were already live for inspection.
A popular r/LocalLLaMA thread described using Gemma 4’s 256k context window to analyze a 100k+ token personal journal locally, turning privacy into a practical reason to run an LLM on-device.
A r/LocalLLaMA stress test claims Gemma 4 26B A4B remained coherent at roughly 94% of a 262,144-token context window in llama.cpp. The post is anecdotal, but it is valuable because it pairs the claim with concrete tuning details and failure modes.
A post on r/MachineLearning argues that LoCoMo’s leaderboard is being treated with more confidence than its evaluation setup deserves. The audit claims the benchmark has a 6.4% ground-truth error rate and that its judge accepts intentionally wrong but topically adjacent answers far too often, turning attention from raw scores to benchmark reliability.
Together Research said on March 27, 2026 that a smaller model using divide-and-conquer can match or outperform GPT-4o on long-context tasks, with the work accepted at ICLR 2026. Together's blog and the arXiv paper say the method uses a planner-worker-manager pipeline and explains long-context failures in terms of task, model, and aggregator noise.
Anthropic says 1M context is now generally available for Opus 4.6 and Sonnet 4.6 with standard pricing, no long-context premium, and media limits expanded to 600 images or PDF pages. Hacker News treated the announcement as a practical deployment story rather than a simple spec bump.
Azure posted on March 14, 2026 that Claude Opus 4.6 and Sonnet 4.6 now support 1M-token context in Microsoft Foundry with flat pricing and higher media limits. Microsoft and Anthropic documentation confirm the 1M window, 600 image/PDF-page cap, and standard pricing across the full context range.
A Hacker News discussion highlighted LoGeR, a Google DeepMind and UC Berkeley project that uses hybrid memory to scale dense 3D reconstruction across extremely long videos without post-hoc optimization.
A high-engagement r/LocalLLaMA post surfaced the Qwen3.5-35B-A3B model card on Hugging Face. The card emphasizes MoE efficiency, long context handling, and deployment paths across common open-source inference stacks.