Ternary Bonsai, 1.58-bit open 8B model을 1.75GB로 줄였다

PrismML의 4월 16일 X post는 open-model builders에게 구체적인 efficiency claim을 던졌다는 점에서 material하다. source tweet은 Ternary Bonsai가 "ternary weights {-1, 0, +1}"를 사용한다고 썼고, family를 1.58-bit language models로 framing했다. 게시 시각은 2026-04-16 17:39:18 UTC로 요청된 48시간 window 안이다. source tweet도 함께 남긴다.

핵심은 숫자다. PrismML은 이 models가 16-bit counterparts보다 9x 작고, Apache 2.0 license로 8B 1.75GB, 4B 0.86GB, 1.7B 0.37GB 세 가지 size로 공개된다고 적었다. public Hugging Face collection에는 Ternary Bonsai collection, MLX model entries, demo collection이 보이며 4월 16일 update가 찍혀 있다. community replies는 ONNX, MLX, browser WebGPU demos도 언급하지만, 다음에 자세히 볼 것은 model cards와 benchmark details다.

technical hook은 ternary weight format이다. 각 weight를 higher-precision floating-point value로 저장하는 대신, model family가 weights를 세 가지 값으로 제한하고 training과 kernels로 usable quality를 유지하려는 접근이다. 그래서 size number가 공격적으로 보이며, deployment support가 headline benchmark image만큼 중요하다. Hugging Face collection의 MLX entries는 Apple Silicon을 intended local path 중 하나로 가리킨다. browser와 WebGPU demos가 안정적으로 동작한다면 client-side agents에도 의미가 생긴다. independent perplexity, coding, instruction-following tests가 compression의 실용성을 판정할 것이다.

PrismML은 AI efficiency를 중심에 둔 연구 조직으로 자신을 소개한다. 그래서 이번 post는 local inference와 low-memory inference를 현실적인 선택지로 만들려는 기존 방향과 맞다. 다음 관전점은 replication이다. benchmark image와 model cards가 independent tests에서도 유지된다면, 1.58-bit family는 browser demo, phones, private local agents에서 의미가 커질 수 있다. 그렇지 않더라도 extreme quantization에서 reasoning quality가 어디까지 살아남는지 보여주는 유용한 stress test가 된다.

Ternary Bonsai, 1.58-bit open 8B model을 1.75GB로 줄였다

Related Articles

Quantized Gemma 4 31B, 메모리 절반으로 tokens/sec를 거의 두 배 끌어올리다

r/LocalLLaMA가 Qwen3.5-9B quant를 다시 세운 기준: 감이 아니라 KLD로 고르자

Google, Gemma 4로 on-device agentic workflow 확장

Comments (0)

Leave a Comment

Related Articles

Quantized Gemma 4 31B, 메모리 절반으로 tokens/sec를 거의 두 배 끌어올리다

r/LocalLLaMA가 Qwen3.5-9B quant를 다시 세운 기준: 감이 아니라 KLD로 고르자

Google, Gemma 4로 on-device agentic workflow 확장