Ternary Bonsai squeezes 8B models to 1.75GB at 1.58 bits
Original: Today we’re announcing Ternary Bonsai: Top intelligence at 1.58 bits View original →
PrismML's April 16 X post is material because it gives open-model builders a concrete efficiency claim. The source tweet says Ternary Bonsai uses "ternary weights {-1, 0, +1}" and frames the family as 1.58-bit language models. It was created at 2026-04-16 17:39:18 UTC, inside the requested 48-hour window. See the source tweet.
The numbers are the story. PrismML says the models are 9x smaller than their 16-bit counterparts and are released under Apache 2.0 in three sizes: 8B at 1.75GB, 4B at 0.86GB, and 1.7B at 0.37GB. The public Hugging Face collection lists the Ternary Bonsai collection, MLX model entries, and demo collection, with updates on April 16. Community replies also point to ONNX, MLX, and browser WebGPU demos, but the model cards and benchmark details are what need close reading next.
The technical hook is the ternary weight format. Instead of storing each weight as a higher-precision floating-point value, the model family restricts weights to three values and relies on training and kernels to keep quality usable. That is why the size numbers are so aggressive, and why deployment support matters as much as the headline benchmark image. The Hugging Face collection's MLX entries point to Apple Silicon as one intended local path, while browser and WebGPU demos would make the release more interesting for client-side agents. Independent perplexity, coding, and instruction-following tests will decide whether the compression is practical or mostly a research artifact.
PrismML describes itself as focused on AI efficiency, so this post fits its usual lane: making local and low-memory inference more practical. The next watch item is replication. If the benchmark image and model cards hold up across independent tests, a 1.58-bit family that stays usable at 8B, 4B, and 1.7B sizes could matter for browser demos, phones, and private local agents. If not, the release will still be a useful stress test for how much reasoning quality survives extreme quantization.
Related Articles
Quantization only matters when the accuracy hit stays small enough to use in production. Red Hat AI says its quantized Gemma 4 31B keeps 99%+ accuracy while delivering nearly 2x tokens/sec at half the memory footprint, with weights released openly via LLM Compressor.
r/LocalLLaMA liked this comparison because it replaces reputation and anecdote with a more explicit distribution-based yardstick. The post ranks community Qwen3.5-9B GGUF quants by mean KLD versus a BF16 baseline, with Q8_0 variants leading on fidelity and several IQ4/Q5 options standing out on size-to-drift trade-offs.
A Vulmon X post on April 7, 2026 surfaced CVE-2026-1839, an arbitrary code execution issue in Hugging Face Transformers Trainer checkpoint loading. CVE.org says affected versions before v5.0.0rc3 can execute malicious code from crafted rng_state.pth files under PyTorch below 2.6, and the fix adds weights_only=True.
Comments (0)
No comments yet. Be the first to comment!