Ternary Bonsai squeezes 8B models to 1.75GB at 1.58 bits
Original: Today we’re announcing Ternary Bonsai: Top intelligence at 1.58 bits View original →
PrismML's April 16 X post is material because it gives open-model builders a concrete efficiency claim. The source tweet says Ternary Bonsai uses "ternary weights {-1, 0, +1}" and frames the family as 1.58-bit language models. It was created at 2026-04-16 17:39:18 UTC, inside the requested 48-hour window. See the source tweet.
The numbers are the story. PrismML says the models are 9x smaller than their 16-bit counterparts and are released under Apache 2.0 in three sizes: 8B at 1.75GB, 4B at 0.86GB, and 1.7B at 0.37GB. The public Hugging Face collection lists the Ternary Bonsai collection, MLX model entries, and demo collection, with updates on April 16. Community replies also point to ONNX, MLX, and browser WebGPU demos, but the model cards and benchmark details are what need close reading next.
The technical hook is the ternary weight format. Instead of storing each weight as a higher-precision floating-point value, the model family restricts weights to three values and relies on training and kernels to keep quality usable. That is why the size numbers are so aggressive, and why deployment support matters as much as the headline benchmark image. The Hugging Face collection's MLX entries point to Apple Silicon as one intended local path, while browser and WebGPU demos would make the release more interesting for client-side agents. Independent perplexity, coding, and instruction-following tests will decide whether the compression is practical or mostly a research artifact.
PrismML describes itself as focused on AI efficiency, so this post fits its usual lane: making local and low-memory inference more practical. The next watch item is replication. If the benchmark image and model cards hold up across independent tests, a 1.58-bit family that stays usable at 8B, 4B, and 1.7B sizes could matter for browser demos, phones, and private local agents. If not, the release will still be a useful stress test for how much reasoning quality survives extreme quantization.
Related Articles
A merged Hugging Face Transformers PR surfaced on r/LocalLLaMA shows Mistral 4 as a hybrid instruct/reasoning model with 128 experts, 4 active experts, 6.5B activated parameters per token, 256k context, and Apache 2.0 licensing.
Google's AI Edge team said on April 2, 2026 that Gemma 4 is bringing multi-step agentic workflows to phones, desktops, and edge hardware under an Apache 2.0 license. The launch combines open models, Agent Skills, and LiteRT-LM deployment tooling.
Quantization only matters when the accuracy hit stays small enough to use in production. Red Hat AI says its quantized Gemma 4 31B keeps 99%+ accuracy while delivering nearly 2x tokens/sec at half the memory footprint, with weights released openly via LLM Compressor.
Comments (0)
No comments yet. Be the first to comment!