#llama

AI Mar 29, 2026 2 min read

Meta maps MTIA 300-500 roadmap to scale AI services for billions

Meta said its in-house MTIA roadmap now spans MTIA 300, 400, 450, and 500. The company said the 2026 and 2027 deployments are aimed at lowering the cost and latency of serving GenAI workloads at massive scale.

#meta #ai-chips #inference

LLM Reddit Mar 22, 2026 2 min read

r/LocalLLaMA Highlights Graph-RAG Work That Lets Llama 8B Challenge 70B Multi-Hop QA

A fresh r/LocalLLaMA post argues that the main bottleneck in Graph-RAG multi-hop QA is often reasoning rather than retrieval. The linked paper suggests structured prompting and graph-based context compression can let an open Llama 8B model match or beat a plain 70B baseline at a much lower cost.

#graph-rag #llama #reasoning

LLM Hacker News Feb 22, 2026 1 min read

Running Llama 3.1 70B on a Single RTX 3090 via NVMe-to-GPU

A new open-source project called ntransformer enables running the 140GB Llama 3.1 70B model on a single consumer RTX 3090 by streaming weights directly from NVMe storage to GPU, completely bypassing CPU RAM.

#llama #gpu #open-source

LLM Hacker News Feb 20, 2026 2 min read

Taalas proposes model-specific silicon for low-latency AI inference

A high-engagement Hacker News thread spotlights Taalas’ claim that model-specific silicon can cut inference latency and cost, including a hard-wired Llama 3.1 8B deployment reportedly reaching 17K tokens/sec per user.

#llm #inference #ai-hardware