Meta said its in-house MTIA roadmap now spans MTIA 300, 400, 450, and 500. The company said the 2026 and 2027 deployments are aimed at lowering the cost and latency of serving GenAI workloads at massive scale.
#llama
RSS FeedAI Mar 29, 2026 2 min read
LLM Reddit Mar 22, 2026 2 min read
A fresh r/LocalLLaMA post argues that the main bottleneck in Graph-RAG multi-hop QA is often reasoning rather than retrieval. The linked paper suggests structured prompting and graph-based context compression can let an open Llama 8B model match or beat a plain 70B baseline at a much lower cost.
LLM Hacker News Feb 22, 2026 1 min read
A new open-source project called ntransformer enables running the 140GB Llama 3.1 70B model on a single consumer RTX 3090 by streaming weights directly from NVMe storage to GPU, completely bypassing CPU RAM.
LLM Hacker News Feb 20, 2026 2 min read
A high-engagement Hacker News thread spotlights Taalas’ claim that model-specific silicon can cut inference latency and cost, including a hard-wired Llama 3.1 8B deployment reportedly reaching 17K tokens/sec per user.