Covenant-72B puts permissionless distributed GPU training ahead of raw hype
Original: 1Covenant/Covenant-72B: Largest model so far to be trained on decentralized permissionless GPU nodes View original →
Covenant-72B drew attention on r/LocalLLaMA because of how it was trained, not because the thread claimed a clean benchmark sweep. The post reached 92 points and 25 comments and framed the release as the largest model so far to be trained on decentralized permissionless GPU nodes. According to the Hugging Face model card, Covenant-72B is a 72B-parameter language model trained from scratch on 1.1 trillion English tokens. The same card describes it as the largest permissionless collaboratively trained language model released so far.
The engineering claim that matters most is the participation model. The model card says 20+ globally distributed participants coordinated through decentralized infrastructure on the Bittensor blockchain. The technical report abstract adds important context: earlier globally distributed training efforts were either smaller or relied on whitelisted participation. Covenant-72B instead targeted fully permissionless participation and dynamic participation over the internet. In practical terms, that makes this project interesting as a systems milestone. It suggests that large-scale pre-training does not necessarily require a closed consortium with tightly controlled membership if the training stack is built to tolerate unstable connectivity and changing contributors.
The published architecture details are straightforward and worth separating from the broader narrative. Covenant-72B uses 80 layers, 64 attention heads with 8 KV heads, and a hidden size of 8192, and it is released under the Apache 2.0 license. The release is also explicitly a base model, with a separate instruction-tuned variant named Covenant-72B-Chat. That distinction mattered in the Reddit discussion. One commenter viewed the Apache 2.0 license and base-model positioning positively, which is consistent with how open-model users often evaluate reuse potential. Another commenter argued that the raw performance was not state of the art. Taken together, the thread reads less like consensus around a frontier model and more like a debate over what kind of progress should count most.
The training method is central to that debate. The Reddit post highlighted SparseLoCo, described as building on DiLoCo while cutting synchronization frequency. The write-up specifically called out local AdamW, top-k sparsification, and 2-bit quantization as tools for reducing communication cost. That matters because globally distributed training over the public internet is usually constrained more by communication than by arithmetic throughput. The SparseLoCo abstract says the method reaches 1-3% sparsity while outperforming full-precision DiLoCo in communication-constrained settings. That is a targeted claim about the training regime, not a blanket statement about overall model quality, and it helps explain why the project could support dynamic, non-whitelisted participation.
Benchmark discussion should stay narrow. The model card includes comparisons against INTELLECT-1, Psyche Consilience, LLM360 K2, and LLaMA-2-70B, but the source notes here do not justify declaring Covenant-72B a new performance leader. A more defensible takeaway is that the release packages several meaningful signals at once: a 72B base model trained from scratch, a permissionless collaborative setup involving 20+ participants, and a communication-efficient method intended for unstable wide-area coordination. For the open LLM community, that combination may matter as much as any single benchmark table because it points to a different way of organizing large-model development.
Related Articles
A high-signal LocalLLaMA thread on March 15, 2026 focused on a license swap for NVIDIA’s Nemotron model family. Comparing the current NVIDIA Nemotron Model License with the older Open Model License shows why the community reacted: the old guardrail-termination clause and Trustworthy AI cross-reference are no longer present, while the newer text leans on a simpler NOTICE-style attribution structure.
A high-engagement r/LocalLLaMA thread tracked the MiniMax-M2.5 release on Hugging Face. The model card emphasizes agentic coding/search benchmarks, runtime speedups, and aggressive cost positioning.
NVIDIA AI Developer introduced Nemotron 3 Super on March 11, 2026 as an open 120B-parameter hybrid MoE model with 12B active parameters and a native 1M-token context window. NVIDIA says the model targets agentic workloads with up to 5x higher throughput than the previous Nemotron Super model.
Comments (0)
No comments yet. Be the first to comment!