Dante-2B pitches an Italian-first open model instead of an English-first fine-tune
Original: [P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built. View original →
A small open model project with an explicit Italian-first thesis
In a 2026-04-05 post on r/MachineLearning, the creator of Dante-2B argued that many open models still treat Italian as an afterthought and therefore lose efficiency and quality before fine-tuning even begins. When reviewed, the post had a score of 54 and 16 comments. The thesis was not that another generic open model is needed. It was that an Italian/English bilingual model should be designed from the tokenizer upward, rather than inheriting an English-first setup and hoping instruction tuning fixes the gap later.
According to the post, Dante-2B is a 2.1B-parameter decoder-only dense transformer trained from random initialization. The architecture uses LLaMA-style GQA, SwiGLU FFN, RMSNorm, and RoPE, with d_model=2560, 28 layers, d_head=128, and a 20/4 query-to-KV head split. The most interesting piece is the tokenizer. The author said they built a custom 64K BPE tokenizer for Italian, English, and code so that Italian apostrophe contractions and accented characters are handled more naturally. The example given was l'intelligenza, which an English-centric tokenizer may split into several pieces, wasting context and weakening morphology.
- The reported corpus size was roughly 300B tokens, drawing from FineWeb-2 IT, FineWeb-Edu, Italian public-domain literature, legal and parliamentary text, Wikipedia, and StarCoderData.
- Phase 1 covered 100B tokens at
seq_len 2048using DeepSpeed ZeRO-2,torch.compile, andtorchaoFP8 on 2× H200 GPUs, with the author claiming about 16 days of runtime and roughly 28% MFU. - Phase 2 is described as a 20B-token extension toward 4096 context length, after which the creator plans a HuggingFace release, tokenizer release, and later SFT work.
Why does this matter? Because the post puts tokenizer design back at the center of multilingual model quality. A lot of open-model discussion still focuses on parameter count, benchmark scores, and downstream alignment, but the Dante-2B writeup argues that the earliest design choices can already bias a model against languages that are not well represented in default English-centric vocabularies. Several commenters reacted along those lines. One said tokenizer work is exactly where multilingual systems often fail quietly. Another asked how clean the Italian corpus is from a licensing perspective, which highlights that local-language quality and release-ready provenance are equally important if the model is meant to become a serious open artifact.
This is still a self-reported progress update, not an independently validated release. There are no public weights or third-party benchmarks yet, and the author explicitly did not claim frontier-level reasoning. But the post does offer a concrete blueprint for a different class of open model project: smaller scale, language-specific, tokenizer-aware, and willing to publish the full pipeline instead of only a checkpoint.
Source link: Reddit thread.
Related Articles
The thread’s useful tension was not whether AI can write code fast, but whether slower review loops produce code teams can actually trust.
Liquid AI's new LFM2.5 8B-A1B MoE model delivers 253 tokens/s on M5 Max, runs under 6GB memory on mobile, and achieves 18,500 output tokens/s on H100—all while outperforming similarly-sized dense models on key benchmarks.
The HN reaction centered on the README as much as the code: a small engine that turns vLLM concepts into a guided implementation path.
Comments (0)
No comments yet. Be the first to comment!