A March 14, 2026 LocalLLaMA post outlined a CUTLASS and FlashInfer patch for SM120 Blackwell workstations, claiming major gains for Qwen3.5-397B NVFP4 inference and linking the work to FlashInfer PR #2786.
LLM
RSS FeedStepFun opened more than a model card by releasing the Step-3.5-Flash-SFT dataset on Hugging Face. The repo bundles raw JSON data, tokenizer snapshots, and StepTronOSS-oriented compiled shards, while the Reddit discussion focused on reproducibility, reasoning traces, and the implications of the dual-license setup.
A r/LocalLLaMA field report showed how a very specific local inference workload was tuned for throughput. The author reported about 2,000 tokens per second while classifying markdown documents with Qwen 3.5 27B, and the comment thread turned the post into a practical optimization discussion.
Anthropic says 1M context is now generally available for Opus 4.6 and Sonnet 4.6 with standard pricing, no long-context premium, and media limits expanded to 600 images or PDF pages. Hacker News treated the announcement as a practical deployment story rather than a simple spec bump.
Google is rolling out new Gemini beta features for Docs, Sheets, Slides, and Drive to Google AI Ultra and Pro subscribers. The update lets Gemini create and edit work using files, emails, and the web, while Drive adds AI Overviews and a new Ask Gemini flow.
Perplexity said on March 11, 2026 that its Sandbox API will become both an Agent API tool and a standalone service. Existing docs already frame Agent API as a multi-provider interface with explicit tool configuration, so the update pushes code execution closer to a first-class orchestration primitive.
Together AI said on March 13, 2026 that v2 of Open Deep Research is fully free and open source. The companion blog describes a planner and self-reflection workflow for multi-hop web research and ships code plus evaluation assets for developers.
A r/MachineLearning post argues that Meta’s COCONUT results may owe more to curriculum design and sequential processing than to the headline mechanism of recycling hidden states as latent thought tokens.
A LocalLLaMA post claims a QLoRA-tuned 14B Qwen coder model can beat frontier proprietary models on Ada compilation tasks, reviving interest in domain-specific coding models for niche but high-stakes languages.
A Hacker News thread amplified a March 12 analysis arguing that LLM coding progress looks much weaker when measured by maintainer merge decisions rather than test-passing SWE-bench scores.
The arXiv paper Ares, submitted on March 9, 2026, proposes dynamic per-step reasoning selection for multi-step LLM agents. The authors report up to 52.7% lower reasoning token usage versus fixed high-effort settings with only minimal drops in task success.
IBM unveiled Granite 4.0 1B Speech on March 9, 2026 as a compact multilingual speech-language model for ASR and bidirectional speech translation. The company says it improves English transcription accuracy over its predecessor while cutting model size in half and adding Japanese support.