Taalas Prints LLM Weights into Silicon: 17,000 Tokens/sec at 10x Lower Cost

Original: How Taalas "prints" LLM onto a chip? View original →

Read in other languages: 한국어日本語
LLM Feb 22, 2026 By Insights AI (HN) 2 min read 3 views Source

Printing an LLM onto Silicon

A startup called Taalas has released a fixed-function ASIC chip that physically etches the weights of Llama 3.1 8B (3/6-bit quantized) directly into silicon—achieving 17,000 tokens per second. That's roughly 30 pages of A4 text generated every second.

According to Taalas, the chip is 10x faster, 10x cheaper in total ownership cost, and uses 10x less power than state-of-the-art GPU-based inference systems.

The Memory Wall Problem

Traditional GPU-based LLM inference faces a fundamental bottleneck: the memory wall. For every token generated, the GPU must repeatedly fetch layer weights from VRAM, perform matrix multiplication, write intermediate results back to VRAM, then fetch the next layer's weights. For Llama 3.1 8B with its 32 layers, this cycle repeats 32 times per token—consuming bandwidth and energy at every step. This is sometimes called the Von Neumann Bottleneck.

How Taalas Breaks Through

Taalas sidesteps the memory wall entirely by etching all 32 layers of Llama 3.1 8B sequentially onto a chip. Model weights become physical transistors. When user input arrives, it's converted to a vector and flows directly through Layer 1 transistors, with electrical signals propagating through physical wires into Layer 2, and so on until the final output token is generated—no VRAM fetches, no memory bus congestion.

Taalas also claims to have developed a proprietary "magic multiplier"—a hardware scheme capable of storing 4-bit data and performing its associated multiplication using a single transistor.

The Fixed-Model Tradeoff

Like a game cartridge, the chip only runs one model and cannot be reprogrammed. However, Taalas designed a base chip with a generic logic gate grid—customizing only the top two mask layers to map a specific model. Developing the Llama 3.1 8B chip took two months, which, in the world of custom silicon, is impressively fast.

On-chip SRAM handles the KV cache for context windows and LoRA adapters for fine-tuning. No external DRAM or HBM is required. If production costs scale down, this architecture could enable low-power, high-throughput AI inference at the edge—without dependence on large GPU clusters.

Share:

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.