Dante-2B pitches an Italian-first open model instead of an English-first fine-tune

Original: [P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built. View original →

Read in other languages: 한국어日本語
LLM Apr 11, 2026 By Insights AI (Reddit) 2 min read 1 views Source

A small open model project with an explicit Italian-first thesis

In a 2026-04-05 post on r/MachineLearning, the creator of Dante-2B argued that many open models still treat Italian as an afterthought and therefore lose efficiency and quality before fine-tuning even begins. When reviewed, the post had a score of 54 and 16 comments. The thesis was not that another generic open model is needed. It was that an Italian/English bilingual model should be designed from the tokenizer upward, rather than inheriting an English-first setup and hoping instruction tuning fixes the gap later.

According to the post, Dante-2B is a 2.1B-parameter decoder-only dense transformer trained from random initialization. The architecture uses LLaMA-style GQA, SwiGLU FFN, RMSNorm, and RoPE, with d_model=2560, 28 layers, d_head=128, and a 20/4 query-to-KV head split. The most interesting piece is the tokenizer. The author said they built a custom 64K BPE tokenizer for Italian, English, and code so that Italian apostrophe contractions and accented characters are handled more naturally. The example given was l'intelligenza, which an English-centric tokenizer may split into several pieces, wasting context and weakening morphology.

  • The reported corpus size was roughly 300B tokens, drawing from FineWeb-2 IT, FineWeb-Edu, Italian public-domain literature, legal and parliamentary text, Wikipedia, and StarCoderData.
  • Phase 1 covered 100B tokens at seq_len 2048 using DeepSpeed ZeRO-2, torch.compile, and torchao FP8 on 2× H200 GPUs, with the author claiming about 16 days of runtime and roughly 28% MFU.
  • Phase 2 is described as a 20B-token extension toward 4096 context length, after which the creator plans a HuggingFace release, tokenizer release, and later SFT work.

Why does this matter? Because the post puts tokenizer design back at the center of multilingual model quality. A lot of open-model discussion still focuses on parameter count, benchmark scores, and downstream alignment, but the Dante-2B writeup argues that the earliest design choices can already bias a model against languages that are not well represented in default English-centric vocabularies. Several commenters reacted along those lines. One said tokenizer work is exactly where multilingual systems often fail quietly. Another asked how clean the Italian corpus is from a licensing perspective, which highlights that local-language quality and release-ready provenance are equally important if the model is meant to become a serious open artifact.

This is still a self-reported progress update, not an independently validated release. There are no public weights or third-party benchmarks yet, and the author explicitly did not claim frontier-level reasoning. But the post does offer a concrete blueprint for a different class of open model project: smaller scale, language-specific, tokenizer-aware, and willing to publish the full pipeline instead of only a checkpoint.

Source link: Reddit thread.

Share: Long

Related Articles

LLM Hacker News 6d ago 2 min read

A Hacker News discussion surfaced a new paper showing that a model can improve coding performance by training on its own sampled answers. The authors report Qwen3-30B-Instruct rising from 42.4% to 55.3% pass@1 on LiveCodeBench v6 without a verifier, a teacher model, or reinforcement learning.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.