Taalas Prints LLM Weights into Silicon: 17,000 Tokens/sec at 10x Lower Cost
Original: How Taalas "prints" LLM onto a chip? View original →
Printing an LLM onto Silicon
A startup called Taalas has released a fixed-function ASIC chip that physically etches the weights of Llama 3.1 8B (3/6-bit quantized) directly into silicon—achieving 17,000 tokens per second. That's roughly 30 pages of A4 text generated every second.
According to Taalas, the chip is 10x faster, 10x cheaper in total ownership cost, and uses 10x less power than state-of-the-art GPU-based inference systems.
The Memory Wall Problem
Traditional GPU-based LLM inference faces a fundamental bottleneck: the memory wall. For every token generated, the GPU must repeatedly fetch layer weights from VRAM, perform matrix multiplication, write intermediate results back to VRAM, then fetch the next layer's weights. For Llama 3.1 8B with its 32 layers, this cycle repeats 32 times per token—consuming bandwidth and energy at every step. This is sometimes called the Von Neumann Bottleneck.
How Taalas Breaks Through
Taalas sidesteps the memory wall entirely by etching all 32 layers of Llama 3.1 8B sequentially onto a chip. Model weights become physical transistors. When user input arrives, it's converted to a vector and flows directly through Layer 1 transistors, with electrical signals propagating through physical wires into Layer 2, and so on until the final output token is generated—no VRAM fetches, no memory bus congestion.
Taalas also claims to have developed a proprietary "magic multiplier"—a hardware scheme capable of storing 4-bit data and performing its associated multiplication using a single transistor.
The Fixed-Model Tradeoff
Like a game cartridge, the chip only runs one model and cannot be reprogrammed. However, Taalas designed a base chip with a generic logic gate grid—customizing only the top two mask layers to map a specific model. Developing the Llama 3.1 8B chip took two months, which, in the world of custom silicon, is impressively fast.
On-chip SRAM handles the KV cache for context windows and LoRA adapters for fine-tuning. No external DRAM or HBM is required. If production costs scale down, this architecture could enable low-power, high-throughput AI inference at the edge—without dependence on large GPU clusters.
Related Articles
Startup Taalas proposes baking entire LLM weights and architecture into custom ASICs, claiming 17K+ tokens/second per user, sub-1ms latency, and 20x lower cost than cloud — all achievable within a 60-day chip production cycle.
A high-engagement Hacker News thread spotlights Taalas’ claim that model-specific silicon can cut inference latency and cost, including a hard-wired Llama 3.1 8B deployment reportedly reaching 17K tokens/sec per user.
Andrej Karpathy highlights the fundamental memory+compute trade-off challenge in LLMs: fast but small on-chip SRAM versus large but slow off-chip DRAM. He calls optimizing this the most intellectually rewarding puzzle in AI infrastructure today, pointing to NVIDIA's $4.6T market cap as proof.
Comments (0)
No comments yet. Be the first to comment!