Cohere gives LocalLLaMA first hands-on access to an unreleased coding model
Original: Cohere's unreleased coding model (early access for localllama) View original →
Cohere’s Nick Frosst posted an early-access invitation in r/LocalLLaMA for an unreleased coding model. The post describes the model as 30B parameters with 3B active parameters, available for now through CohereLabs/BLS-Mini-Code-1.0 on Hugging Face. More platform support is expected around the formal release.
The notable part is the release path. Instead of leading with a polished benchmark page and then waiting for community testing, Cohere put the weights in front of LocalLLaMA before the model was fully launched. Frosst said the team wanted users to test it against what they are actually trying to do, and that feedback from this release could shape how Cohere continues developing the line.
The positioning is local-first. A 30B total / 3B active setup suggests a model meant to feel larger than a small dense model while keeping runtime costs manageable on some local machines. The post claims internal token-output tests are in line with similar models in its size class, but it also treats the model as unfinished. That makes community feedback more useful than a single leaderboard result.
LocalLLaMA is a hard audience for this sort of experiment. Users will quickly test quantization, VRAM behavior, llama.cpp support, coding tasks, and real throughput, often with less patience for launch messaging than a general developer audience. For Cohere, that is also the point. If the model works well in that environment, the feedback will be unusually concrete; if it does not, the failure modes will show up early. Either way, this looks like a model rollout with the community inside the loop rather than waiting at the end of it.
Related Articles
Liquid AI's new LFM2.5 8B-A1B MoE model delivers 253 tokens/s on M5 Max, runs under 6GB memory on mobile, and achieves 18,500 output tokens/s on H100—all while outperforming similarly-sized dense models on key benchmarks.
r/LocalLLaMA Benchmarks: <code>Krasis</code> reports 3,324 tok/s prefill for 80B MoE on one RTX 5080
A r/LocalLLaMA post (score 180, 53 comments) shared benchmark data for <code>Krasis</code>, a hybrid CPU/GPU runtime aimed at large MoE models. The key claim is that GPU-heavy prefill plus CPU decode can reduce long-context waiting time even when full models do not fit in consumer VRAM.
The expensive part of LLM inference is often the experiment itself. NVIDIA says DynoSim replayed a 23,608-request trace on an Apple M4 MacBook Air in 2.41 seconds, about 1,500x faster than the 60.1-minute serving window it modeled.