Dante-2B pitches an Italian-first open model instead of an English-first fine-tune

Original: [P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built. View original →

Read in other languages: 한국어日本語
LLM Apr 11, 2026 By Insights AI (Reddit) 2 min read 3 views Source

A small open model project with an explicit Italian-first thesis

In a 2026-04-05 post on r/MachineLearning, the creator of Dante-2B argued that many open models still treat Italian as an afterthought and therefore lose efficiency and quality before fine-tuning even begins. When reviewed, the post had a score of 54 and 16 comments. The thesis was not that another generic open model is needed. It was that an Italian/English bilingual model should be designed from the tokenizer upward, rather than inheriting an English-first setup and hoping instruction tuning fixes the gap later.

According to the post, Dante-2B is a 2.1B-parameter decoder-only dense transformer trained from random initialization. The architecture uses LLaMA-style GQA, SwiGLU FFN, RMSNorm, and RoPE, with d_model=2560, 28 layers, d_head=128, and a 20/4 query-to-KV head split. The most interesting piece is the tokenizer. The author said they built a custom 64K BPE tokenizer for Italian, English, and code so that Italian apostrophe contractions and accented characters are handled more naturally. The example given was l'intelligenza, which an English-centric tokenizer may split into several pieces, wasting context and weakening morphology.

  • The reported corpus size was roughly 300B tokens, drawing from FineWeb-2 IT, FineWeb-Edu, Italian public-domain literature, legal and parliamentary text, Wikipedia, and StarCoderData.
  • Phase 1 covered 100B tokens at seq_len 2048 using DeepSpeed ZeRO-2, torch.compile, and torchao FP8 on 2× H200 GPUs, with the author claiming about 16 days of runtime and roughly 28% MFU.
  • Phase 2 is described as a 20B-token extension toward 4096 context length, after which the creator plans a HuggingFace release, tokenizer release, and later SFT work.

Why does this matter? Because the post puts tokenizer design back at the center of multilingual model quality. A lot of open-model discussion still focuses on parameter count, benchmark scores, and downstream alignment, but the Dante-2B writeup argues that the earliest design choices can already bias a model against languages that are not well represented in default English-centric vocabularies. Several commenters reacted along those lines. One said tokenizer work is exactly where multilingual systems often fail quietly. Another asked how clean the Italian corpus is from a licensing perspective, which highlights that local-language quality and release-ready provenance are equally important if the model is meant to become a serious open artifact.

This is still a self-reported progress update, not an independently validated release. There are no public weights or third-party benchmarks yet, and the author explicitly did not claim frontier-level reasoning. But the post does offer a concrete blueprint for a different class of open model project: smaller scale, language-specific, tokenizer-aware, and willing to publish the full pipeline instead of only a checkpoint.

Source link: Reddit thread.

Share: Long

Related Articles

LLM Hacker News 6d ago 2 min read

A recent Show HN post highlighted GuppyLM, a tiny education-first language model trained on 60K synthetic conversations with a deliberately simple transformer stack. The project stands out because readers can inspect and run the whole pipeline in Colab or directly in the browser.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.