Ternary Bonsai hit LocalLLaMA where compression claims get tested

Original: Ternary Bonsai: Top intelligence at 1.58 bits View original →

Read in other languages: 한국어日本語
LLM Apr 17, 2026 By Insights AI (Reddit) 2 min read Source

A smaller model family with a sharp caveat

PrismML's Ternary Bonsai post reached LocalLLaMA with 112 points and 34 comments because it speaks directly to the community's favorite constraint: how much useful model can fit on ordinary hardware. PrismML says Ternary Bonsai uses 1.58-bit weights with three states, {-1, 0, +1}, across embeddings, attention layers, MLPs, and the LM head. The family includes 1.7B, 4B, and 8B parameter models, with the 8B version reported at 1.75GB and a 75.5 average benchmark score.

The headline claim is attractive. PrismML says Ternary Bonsai 8B improves on 1-bit Bonsai 8B by 5 average benchmark points while adding about 600MB, and runs natively on Apple devices through MLX. It also reports 82 toks/sec on M4 Pro and 27 toks/sec on iPhone 17 Pro Max. For edge AI and local assistants, those are the kinds of numbers that make people stop scrolling.

LocalLLaMA asked for fair comparisons

The top comments were not hostile, but they were skeptical in a very LocalLLaMA way. Several users questioned whether comparing Ternary Bonsai size against full 16-bit peers makes the advantage look larger than it would against Q4 quantized models. Others wanted benchmarks against quantized Qwen variants, since the community already lives in the world of GGUF files and mixed quantization tradeoffs rather than clean FP16 baselines.

Another point was provenance. Commenters noted that the models appear to be quantized from Qwen3 rather than trained from scratch with quantization awareness. That does not make the work useless, but it changes how users interpret the claim. LocalLLaMA wants practical models, not just clever tables. If a 1.58-bit model is smaller but loses too much quality compared with a well-tuned Q4 model, the memory win may not be enough.

The real question is the Pareto frontier

Ternary Bonsai is interesting because it lands between two instincts. One instinct wants the smallest possible model that can run everywhere, even in a browser or phone. The other wants the best quality per watt and per gigabyte, especially for always-on local workflows. A 1.58-bit family may be useful if it genuinely shifts that performance-size curve, not merely if it beats uncompressed models in a table.

The thread's energy was therefore constructive pressure. Users asked for larger variants, especially 35B or 122B-style releases, and for stronger comparisons against the formats they actually run. That is a healthy sign. The community is excited by extreme compression, but it has learned to demand reproducible numbers, realistic baselines, and downloads that survive contact with real prompts.

PrismML post · Reddit discussion

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.