Qwen 3.5 0.8B Runs Fully In-Browser via WebGPU and Transformers.js
Original: Running Qwen 3.5 0.8B locally in the browser on WebGPU w/ Transformers.js View original →
LLMs Running Without a Server
A demo showcasing Qwen 3.5 0.8B running entirely in the browser — no server backend required — gained 440 upvotes on r/LocalLLaMA. The demo leverages HuggingFace's Transformers.js library alongside the WebGPU API, using the user's own GPU directly from the browser.
How It Works
Transformers.js is a JavaScript library that enables running Transformer-based models client-side. WebGPU is a modern web API that gives browsers direct access to GPU hardware. As of 2026, WebGPU is supported in approximately 85–90% of browser traffic globally (Chrome, Edge, and Safari). Together, these technologies make it possible to run small LLMs entirely without server infrastructure.
HuggingFace has released a qwen3-webgpu example in its Transformers.js examples repository, and the Transformers.js v4 release (February 2026) deepened ONNX Runtime integration for 3–10x speed improvements on supported models.
Why Qwen 3.5 0.8B
The Qwen 3.5 generation's 0.8B model packs 262K context and multimodal support into a weight feasibly loaded in a browser. Its performance dramatically outclasses what 0.8B-class models could do in prior generations, making the browser AI experience genuinely useful rather than just a proof of concept.
Implications
Browser-native AI deployment enables privacy-first applications (data never leaves the device), zero server costs, and offline AI capabilities. Use cases include translation extensions, document analysis, coding assistants, and more — all running without sending data to any external server.
Related Articles
A community user achieved 110 tokens/second running Qwen3.6 35B A3B on an RTX 4070 Super 12GB via ik_llama.cpp, a fork with superior CPU offload optimization that significantly outperforms upstream llama.cpp's Multi-Token Prediction implementation.
A viral LocalLLaMA post describes how Qwen3.6 35B A3B transformed complex workflows by combining Codex for task execution with skill documentation, feeding those skills to the pi agent — automating VPS management, PDF conversion, and more.
The thread split between the convenience of “local LLM in Chrome” and corrections about WebGPU acceleration, model identity, and browser-controlled limits.