Azure posted on March 14, 2026 that Claude Opus 4.6 and Sonnet 4.6 now support 1M-token context in Microsoft Foundry with flat pricing and higher media limits. Microsoft and Anthropic documentation confirm the 1M window, 600 image/PDF-page cap, and standard pricing across the full context range.
LLM
RSS FeedA fast-rising r/LocalLLaMA thread says the community has already submitted nearly 10,000 Apple Silicon benchmark runs across more than 400 models. The post matters because it replaces scattered anecdotes with a shared dataset that begins to show consistent throughput patterns across M-series chips and context lengths.
A recent r/LocalLLaMA benchmark thread argues that tokens-per-second screenshots hide the real trade-offs between MLX and llama.cpp on Apple Silicon. MLX still wins on short-context generation, but long-context workloads can erase that headline speedup because prefill dominates total latency.
A March 13, 2026 Hacker News thread focused on Anthropic's 1M context GA update for Claude Opus 4.6 and Sonnet 4.6, especially the removal of long-context premiums. The release also raises media limits to 600 images or PDF pages and rolls 1M context into Claude Code for Max, Team, and Enterprise users.
Google introduced the Developer Knowledge API and an open-source MCP Server on February 4, 2026. The tools are meant to connect internal documentation, public URLs, and other team knowledge sources to Gemini Code Assist and AI-agent workflows with less custom plumbing.
Andrej Karpathy says his autoresearch setup reduced nanochat's Time to GPT-2 from 2.02 hours to 1.80 hours. He said the agent explored roughly 700 changes over about two days and found around 20 additive improvements, but the result should still be read as a source claim rather than an independently audited benchmark.
A Reddit thread surfaced arXiv paper 2603.10145, which argues the output layer of language models is not just a softmax expressivity issue but an optimization bottleneck that suppresses 95-99% of gradient norm. The discussion centered on whether better head designs could unlock more efficient LLM training.
A high-scoring discussion in r/MachineLearning asks what benchmarking papers are for when proprietary models change monthly and old versions disappear. The strongest replies argued that model rankings go stale fast, but the datasets and failure cases can remain useful as durable eval assets.
Percepta's March 11 post says it built a computer inside a transformer that can execute arbitrary C programs for millions of steps with exponentially faster inference via 2D attention heads. HN readers saw a provocative research direction, but they also asked for clearer writing, harder benchmarks, and evidence that the idea scales.
CanIRun.ai runs entirely in the browser, detects GPU, CPU, and RAM through WebGL, WebGPU, and navigator APIs, and estimates which quantized models fit your machine. HN readers liked the idea but immediately pushed on missing hardware entries, calibration, and reverse-lookup features.
NVIDIA introduced Nemotron 3 Super on March 11, 2026 as an open 120B-parameter model built for agentic AI systems. The company says the model tackles long-context cost and reasoning overhead with a 1M-token window, hybrid MoE design and up to 5x higher throughput.
Google has put Gemini Embedding 2 into public preview through the Gemini API and Vertex AI. The model is Google’s first natively multimodal embedding system, combining text, image, video, audio, and document inputs in one embedding space.