Dante-2B pitches an Italian-first open model instead of an English-first fine-tune
Original: [P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built. View original →
A small open model project with an explicit Italian-first thesis
In a 2026-04-05 post on r/MachineLearning, the creator of Dante-2B argued that many open models still treat Italian as an afterthought and therefore lose efficiency and quality before fine-tuning even begins. When reviewed, the post had a score of 54 and 16 comments. The thesis was not that another generic open model is needed. It was that an Italian/English bilingual model should be designed from the tokenizer upward, rather than inheriting an English-first setup and hoping instruction tuning fixes the gap later.
According to the post, Dante-2B is a 2.1B-parameter decoder-only dense transformer trained from random initialization. The architecture uses LLaMA-style GQA, SwiGLU FFN, RMSNorm, and RoPE, with d_model=2560, 28 layers, d_head=128, and a 20/4 query-to-KV head split. The most interesting piece is the tokenizer. The author said they built a custom 64K BPE tokenizer for Italian, English, and code so that Italian apostrophe contractions and accented characters are handled more naturally. The example given was l'intelligenza, which an English-centric tokenizer may split into several pieces, wasting context and weakening morphology.
- The reported corpus size was roughly 300B tokens, drawing from FineWeb-2 IT, FineWeb-Edu, Italian public-domain literature, legal and parliamentary text, Wikipedia, and StarCoderData.
- Phase 1 covered 100B tokens at
seq_len 2048using DeepSpeed ZeRO-2,torch.compile, andtorchaoFP8 on 2× H200 GPUs, with the author claiming about 16 days of runtime and roughly 28% MFU. - Phase 2 is described as a 20B-token extension toward 4096 context length, after which the creator plans a HuggingFace release, tokenizer release, and later SFT work.
Why does this matter? Because the post puts tokenizer design back at the center of multilingual model quality. A lot of open-model discussion still focuses on parameter count, benchmark scores, and downstream alignment, but the Dante-2B writeup argues that the earliest design choices can already bias a model against languages that are not well represented in default English-centric vocabularies. Several commenters reacted along those lines. One said tokenizer work is exactly where multilingual systems often fail quietly. Another asked how clean the Italian corpus is from a licensing perspective, which highlights that local-language quality and release-ready provenance are equally important if the model is meant to become a serious open artifact.
This is still a self-reported progress update, not an independently validated release. There are no public weights or third-party benchmarks yet, and the author explicitly did not claim frontier-level reasoning. But the post does offer a concrete blueprint for a different class of open model project: smaller scale, language-specific, tokenizer-aware, and willing to publish the full pipeline instead of only a checkpoint.
Source link: Reddit thread.
Related Articles
A Hacker News discussion surfaced a new paper showing that a model can improve coding performance by training on its own sampled answers. The authors report Qwen3-30B-Instruct rising from 42.4% to 55.3% pass@1 on LiveCodeBench v6 without a verifier, a teacher model, or reinforcement learning.
Anthropic's new interpretability paper argues that emotion-related internal representations in Claude Sonnet 4.5 causally shape behavior, especially under stress.
A high-ranking Hacker News thread amplified Apple's paper on simple self-distillation for code generation, a training recipe that improves pass@1 without verifier models or reinforcement learning.
Comments (0)
No comments yet. Be the first to comment!