Taalas Claims to Bake Entire LLMs Into Silicon for 17K Tokens/Second

The Idea: Etch the Entire LLM Into Silicon

Startup Taalas is proposing a radical departure from standard AI inference architecture: instead of running LLM weights on general-purpose GPUs or cloud clusters, etch the entire model — weights and architecture — directly onto a custom ASIC. No HBM, no memory bottlenecks.

Key Claims

>17,000 tokens per second per user
<1ms latency
20x cheaper than cloud inference
60-day turnaround from model selection to custom chip

The Trade-off

In an era where model architectures evolve every few weeks, locking a model into silicon is a significant bet. Taalas acknowledges this risk and positions their approach for domains where latency matters more than raw intelligence: real-time speech models, avatar generation, and computer vision applications.

The 60-day chip cycle is their answer to the obsolescence problem — faster than traditional ASIC timelines, though still slower than a model weight update. A Llama 3.1 8B demo is available at ChatJimmy.ai for anyone to test the claimed speeds directly.

LLM Hacker News Feb 22, 2026 2 min read

Taalas Prints LLM Weights into Silicon: 17,000 Tokens/sec at 10x Lower Cost

Taalas has released an ASIC chip that physically etches Llama 3.1 8B model weights into silicon, achieving 17,000 tokens per second—10x faster, 10x cheaper, and 10x more power-efficient than GPU-based inference systems.

#taalas #asic #llm

LLM sources.twitter Mar 1, 2026 1 min read

Karpathy on LLM Memory+Compute: SRAM vs DRAM Trade-offs and the Next Hardware Frontier

Andrej Karpathy highlights the fundamental memory+compute trade-off challenge in LLMs: fast but small on-chip SRAM versus large but slow off-chip DRAM. He calls optimizing this the most intellectually rewarding puzzle in AI infrastructure today, pointing to NVIDIA's $4.6T market cap as proof.

#llm #hardware #inference

LLM Reddit Mar 29, 2026 2 min read

MachineLearning Highlights TurboQuant for Weights as 4-Bit Quantization Gets Practical

A new r/MachineLearning post pushes TurboQuant beyond KV-cache talk and into weight compression, with a GitHub implementation that targets drop-in low-bit LLM inference.

#quantization #llm #inference