A March 21, 2026 Hacker News discussion sent tinygrad's tinybox page back up the front page and put a shipping local AI workstation in front of builders looking beyond rented GPU time. The product pitch is notable because it pairs concrete specs with pricing that targets labs and startups trying to run bigger models on premises.
#inference
RSS FeedNVIDIA used GTC 2026 to describe how telecom operators are turning distributed network assets into AI grids. The pitch is that inference for low-latency, edge-heavy workloads should move closer to users, devices, and data.
A few weeks after release, r/LocalLLaMA is converging on task-specific sampler and reasoning-budget presets for Qwen3.5 rather than one default setup.
A LocalLLaMA thread on March 18, 2026 pushed fresh attention toward Mamba-3, a new state space model release from researchers at Carnegie Mellon University, Princeton, Cartesia AI, and Together AI. The project shifts its design goal from training speed to inference efficiency and claims prefill+decode latency wins over Mamba-2, Gated DeltaNet, and Llama-3.2-1B at the 1.5B scale.
Meta said on March 11, 2026 that it is developing and deploying four new generations of MTIA custom chips within the next two years. The company is positioning MTIA as a central part of its AI infrastructure strategy for ranking, recommendations, and GenAI inference workloads.
At GTC on March 16, 2026, NVIDIA announced Dynamo 1.0 as a production-grade open source inference stack for generative and agentic AI. NVIDIA says Dynamo can boost Blackwell inference performance by up to 7x while integrating with major frameworks and cloud providers.
A March 15, 2026 Hacker News post about GreenBoost reached 124 points and 25 comments. The open-source Linux project combines a kernel module and CUDA shim to tier model memory across VRAM, DDR4, and NVMe so larger local LLMs can run without changing inference apps.
NVIDIA said on March 16, 2026 that Dynamo 1.0 is entering production as open source software for generative and agentic inference at scale. The company says the stack can raise Blackwell inference performance by up to 7x and is already supported across major cloud providers, inference platforms, and AI-native companies.
Meta said on March 11, 2026 that it is accelerating its in-house MTIA roadmap across four generations, from MTIA 300 through MTIA 500. The company is using custom silicon to push harder on ranking, recommendation, and especially GenAI inference economics at Meta scale.
Google DeepMind updated Gemini 3.1 Flash-Lite on March 3, 2026 as a low-cost model for high-volume, low-latency work. Google says it supports 128k input, 8k output, multimodal input, native audio generation, and pricing from $0.10 per 1M input tokens.
A March 14, 2026 LocalLLaMA post outlined a CUTLASS and FlashInfer patch for SM120 Blackwell workstations, claiming major gains for Qwen3.5-397B NVFP4 inference and linking the work to FlashInfer PR #2786.
A r/LocalLLaMA field report showed how a very specific local inference workload was tuned for throughput. The author reported about 2,000 tokens per second while classifying markdown documents with Qwen 3.5 27B, and the comment thread turned the post into a practical optimization discussion.