Karpathy on LLM Memory+Compute: SRAM vs DRAM Trade-offs and the Next Hardware Frontier
Original: Karpathy on LLM Memory+Compute: SRAM vs DRAM Trade-offs and the Next Hardware Frontier View original →
The Core Infrastructure Challenge of the LLM Era
AI researcher Andrej Karpathy posted on X in February 2026, noting that with the coming "tsunami of demand for tokens," there are significant opportunities to orchestrate the underlying memory+compute just right for LLMs.
The Fundamental Constraint: SRAM vs DRAM
Karpathy explains a fundamental and non-obvious constraint arising from the chip fabrication process: there are two completely distinct pools of memory with different physical implementations.
- On-chip SRAM: Immediately next to the compute units, incredibly fast, but of very low capacity
- Off-chip DRAM (HBM): Extremely high capacity, but data can only be accessed through what Karpathy calls "a long straw" — meaning bandwidth-limited access
The Design Challenge
Karpathy argues that designing the optimal physical substrate and orchestrating memory+compute across the top-volume LLM workflows — inference prefill/decode, training/fine-tuning — to achieve the best throughput/latency/dollar ratio is "probably today's most interesting intellectual puzzle with the highest rewards." He cites NVIDIA's $4.6 trillion market cap as evidence.
The Current Dilemma
The workflow that matters most — inference decode over long token contexts in tight agentic loops — is arguably the hardest to achieve simultaneously by both camps of what exists today:
- HBM-first (NVIDIA-adjacent): High capacity but bandwidth-constrained
- SRAM-first (Cerebras-adjacent): Fast but capacity-limited
A Note on MatX
Karpathy closes by praising the MatX team as "A++ grade" and mentions having a small involvement, congratulating them on a recent fundraise. His analysis underscores how critical getting the hardware architecture right will be in the race to produce many tokens, fast and cheap.
Related Articles
Startup Taalas proposes baking entire LLM weights and architecture into custom ASICs, claiming 17K+ tokens/second per user, sub-1ms latency, and 20x lower cost than cloud — all achievable within a 60-day chip production cycle.
Taalas has released an ASIC chip that physically etches Llama 3.1 8B model weights into silicon, achieving 17,000 tokens per second—10x faster, 10x cheaper, and 10x more power-efficient than GPU-based inference systems.
A high-engagement Hacker News thread spotlights Taalas’ claim that model-specific silicon can cut inference latency and cost, including a hard-wired Llama 3.1 8B deployment reportedly reaching 17K tokens/sec per user.
Comments (0)
No comments yet. Be the first to comment!