LocalLLaMA reacted with genuine wonder because the demo is simple to grasp: a 1.7B Bonsai model, about 290MB, running in a browser through WebGPU. The same thread also did the useful reality check, asking about tokens per second, hallucinations, llama.cpp support, and whether 1-bit models are ready for anything beyond narrow tasks.
#webgpu
RSS FeedA Show HN thread highlighted Gemma Gem, a Chrome extension that runs Gemma 4 locally via WebGPU and exposes page-reading, clicking, typing, scrolling, screenshot, and JavaScript tools without API keys or server-side inference.
Cohere said on March 28, 2026 that Transcribe is setting a new bar for speech recognition accuracy in real-world noise and linked users to try it. The supporting Hugging Face materials position Transcribe as an Apache 2.0, 2B-parameter ASR model for 14 languages, while a companion WebGPU demo shows the model running locally in the browser.
A LocalLLaMA post claiming that Liquid AI’s LFM2-24B-A2B can run at roughly 50 tokens per second in a browser on an M4 Max reached 79 points and 11 comments. Community interest centered on sparse MoE architecture, ONNX packaging, and whether WebGPU can make the browser a credible local AI deployment target.
CanIRun.ai runs entirely in the browser, detects GPU, CPU, and RAM through WebGL, WebGPU, and navigator APIs, and estimates which quantized models fit your machine. HN readers liked the idea but immediately pushed on missing hardware entries, calibration, and reverse-lookup features.
A demo running Qwen 3.5 0.8B entirely in the browser using WebGPU and Transformers.js scored 440 on r/LocalLLaMA. No server, no API key, no installation required — just a modern browser with GPU access.