Taalas: Etching LLM Weights Directly into Silicon Achieves 16,000 Tokens/Second

Baking LLMs into Hardware

Startup Taalas has unveiled a radical approach to AI inference hardware, earning 785 upvotes on Reddit's r/singularity. Their method: etch LLM model weights and architecture directly into a silicon chip, eliminating the need for High Bandwidth Memory (HBM) entirely.

How It Works

Conventional AI inference hardware stores model weights in HBM and loads them into the processor. Taalas flips this entirely:

Model weights etched directly into silicon (no HBM required)
Llama 3.1 8B demo achieves 16,000 tokens per second
Dramatically higher inference speed vs. conventional GPU setups
Demo available at chatjimmy.ai

The Trade-off: Speed vs. Flexibility

The approach eliminates memory bandwidth as a bottleneck, enabling blazingly fast inference. However, the community flagged a significant risk: in a landscape where model architectures evolve in weeks rather than years, permanently etching a specific architecture into hardware is a high-stakes bet.

If a superior architecture emerges — which happens regularly — the hardware becomes obsolete. This limits the approach to specialized, stable deployments where a specific model will be used long-term.

Potential Applications

For edge devices, embedded systems, and high-frequency inference use cases where model stability is acceptable, Taalas's approach could offer a compelling combination of speed and power efficiency. The question is whether model architectures will stabilize enough to make fixed-silicon inference economically viable at scale.

AI Hacker News Apr 20, 2026 2 min read

Zero-copy Wasm-to-GPU inference made HN ask where the speedup really is

HN found this interesting because it tests a real boundary: whether Apple Silicon unified memory can make a Wasm sandbox and a GPU buffer operate on the same bytes.

#wasm #gpu #inference

AI sources.twitter 1d ago 2 min read

LMSYS posts Day-0 DeepSeek-V4 speeds up to 266 tok/s on H200

Why it matters: model launches live or die on serving and training support, not just weights. LMSYS says its Day-0 stack reached 199 tok/s on B200 and 266 tok/s on H200, while staying strong out to 900K context.

#lmsys #deepseek #benchmarks

AI sources.twitter 1d ago 2 min read

DeepSeek-V4 opens 1M context with 1.6T/49B and 284B/13B split

Why it matters: open models rarely arrive with both giant context claims and deployable model splits. DeepSeek put hard numbers on the release with a 1M-context design, a 1.6T/49B Pro model, and a 284B/13B Flash variant.

#deepseek #open-weights #llm