Ternary Bonsai hit LocalLLaMA where compression claims get tested

A smaller model family with a sharp caveat

PrismML's Ternary Bonsai post reached LocalLLaMA with 112 points and 34 comments because it speaks directly to the community's favorite constraint: how much useful model can fit on ordinary hardware. PrismML says Ternary Bonsai uses 1.58-bit weights with three states, {-1, 0, +1}, across embeddings, attention layers, MLPs, and the LM head. The family includes 1.7B, 4B, and 8B parameter models, with the 8B version reported at 1.75GB and a 75.5 average benchmark score.

The headline claim is attractive. PrismML says Ternary Bonsai 8B improves on 1-bit Bonsai 8B by 5 average benchmark points while adding about 600MB, and runs natively on Apple devices through MLX. It also reports 82 toks/sec on M4 Pro and 27 toks/sec on iPhone 17 Pro Max. For edge AI and local assistants, those are the kinds of numbers that make people stop scrolling.

LocalLLaMA asked for fair comparisons

The top comments were not hostile, but they were skeptical in a very LocalLLaMA way. Several users questioned whether comparing Ternary Bonsai size against full 16-bit peers makes the advantage look larger than it would against Q4 quantized models. Others wanted benchmarks against quantized Qwen variants, since the community already lives in the world of GGUF files and mixed quantization tradeoffs rather than clean FP16 baselines.

Another point was provenance. Commenters noted that the models appear to be quantized from Qwen3 rather than trained from scratch with quantization awareness. That does not make the work useless, but it changes how users interpret the claim. LocalLLaMA wants practical models, not just clever tables. If a 1.58-bit model is smaller but loses too much quality compared with a well-tuned Q4 model, the memory win may not be enough.

The real question is the Pareto frontier

Ternary Bonsai is interesting because it lands between two instincts. One instinct wants the smallest possible model that can run everywhere, even in a browser or phone. The other wants the best quality per watt and per gigabyte, especially for always-on local workflows. A 1.58-bit family may be useful if it genuinely shifts that performance-size curve, not merely if it beats uncompressed models in a table.

The thread's energy was therefore constructive pressure. Users asked for larger variants, especially 35B or 122B-style releases, and for stronger comparisons against the formats they actually run. That is a healthy sign. The community is excited by extreme compression, but it has learned to demand reproducible numbers, realistic baselines, and downloads that survive contact with real prompts.

PrismML post · Reddit discussion

Ternary Bonsai hit LocalLLaMA where compression claims get tested

A smaller model family with a sharp caveat

LocalLLaMA asked for fair comparisons

The real question is the Pareto frontier

Related Articles

PrismML introduces 1-bit Bonsai for edge-ready LLM deployment

Reddit tests PrismML’s Bonsai 1-bit models beyond the announcement hype

r/LocalLLaMA Finds a Privacy-First Use Case for Gemma 4 Long Context

Comments (0)

Leave a Comment

Related Articles

PrismML introduces 1-bit Bonsai for edge-ready LLM deployment
LLM Reddit Apr 1, 2026 2 min read

Reddit tests PrismML’s Bonsai 1-bit models beyond the announcement hype
LLM Reddit Apr 2, 2026 2 min read

r/LocalLLaMA Finds a Privacy-First Use Case for Gemma 4 Long Context