Taalas Prints LLM Weights into Silicon: 17,000 Tokens/sec at 10x Lower Cost

Printing an LLM onto Silicon

A startup called Taalas has released a fixed-function ASIC chip that physically etches the weights of Llama 3.1 8B (3/6-bit quantized) directly into silicon—achieving 17,000 tokens per second. That's roughly 30 pages of A4 text generated every second.

According to Taalas, the chip is 10x faster, 10x cheaper in total ownership cost, and uses 10x less power than state-of-the-art GPU-based inference systems.

The Memory Wall Problem

Traditional GPU-based LLM inference faces a fundamental bottleneck: the memory wall. For every token generated, the GPU must repeatedly fetch layer weights from VRAM, perform matrix multiplication, write intermediate results back to VRAM, then fetch the next layer's weights. For Llama 3.1 8B with its 32 layers, this cycle repeats 32 times per token—consuming bandwidth and energy at every step. This is sometimes called the Von Neumann Bottleneck.

How Taalas Breaks Through

Taalas sidesteps the memory wall entirely by etching all 32 layers of Llama 3.1 8B sequentially onto a chip. Model weights become physical transistors. When user input arrives, it's converted to a vector and flows directly through Layer 1 transistors, with electrical signals propagating through physical wires into Layer 2, and so on until the final output token is generated—no VRAM fetches, no memory bus congestion.

Taalas also claims to have developed a proprietary "magic multiplier"—a hardware scheme capable of storing 4-bit data and performing its associated multiplication using a single transistor.

The Fixed-Model Tradeoff

Like a game cartridge, the chip only runs one model and cannot be reprogrammed. However, Taalas designed a base chip with a generic logic gate grid—customizing only the top two mask layers to map a specific model. Developing the Llama 3.1 8B chip took two months, which, in the world of custom silicon, is impressively fast.

On-chip SRAM handles the KV cache for context windows and LoRA adapters for fine-tuning. No external DRAM or HBM is required. If production costs scale down, this architecture could enable low-power, high-throughput AI inference at the edge—without dependence on large GPU clusters.

Taalas Prints LLM Weights into Silicon: 17,000 Tokens/sec at 10x Lower Cost

Printing an LLM onto Silicon

The Memory Wall Problem

How Taalas Breaks Through

The Fixed-Model Tradeoff

Related Articles

Taalas Claims to Bake Entire LLMs Into Silicon for 17K Tokens/Second

Karpathy on LLM Memory+Compute: SRAM vs DRAM Trade-offs and the Next Hardware Frontier

Taalas proposes model-specific silicon for low-latency AI inference

Related Articles

Taalas Claims to Bake Entire LLMs Into Silicon for 17K Tokens/Second
LLM Reddit Feb 23, 2026 1 min read

Karpathy on LLM Memory+Compute: SRAM vs DRAM Trade-offs and the Next Hardware Frontier
LLM X/Twitter Mar 1, 2026 1 min read

Taalas proposes model-specific silicon for low-latency AI inference
LLM Hacker News Feb 20, 2026 2 min read