r/MachineLearning Follows Dante-2B, a Bilingual Italian/English LLM Trained From Scratch on 2×H200

Original: [P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built. View original →

Read in other languages: 한국어日本語
LLM Apr 8, 2026 By Insights AI (Reddit) 2 min read 1 views Source

A r/MachineLearning post from researcher and entrepreneur angeletti89 has started drawing attention to Dante-2B, a 2.1B parameter Italian/English model being trained from scratch on 2× H200 GPUs. The author says the project is not a fine-tune of Llama or Mistral but a fresh dense decoder-only transformer with 28 layers, d_model=2560, GQA, SwiGLU, RMSNorm, RoPE, and a custom 64K BPE tokenizer tuned specifically for Italian, English, and code.

The tokenizer is the center of the argument. The post points out that English-centric tokenizers often split Italian contractions like l'intelligenza inefficiently, wasting context window and harming morphology handling. Dante-2B's tokenizer was trained on a character-balanced mixture of roughly 42% Italian, 36% English, and 22% code, with custom pre-tokenization rules that keep apostrophe contractions intact and treat accented characters as atomic units.

A small-model recipe aimed at language efficiency

The training setup is unusually detailed for a community post. The author describes a roughly 300B token corpus built from FineWeb-2 Italian, FineWeb-Edu, 171K Italian public-domain books, legal and parliamentary text, bilingual Wikipedia, and StarCoderData. Phase 1 has already finished: 100B tokens at sequence length 2048 using DeepSpeed ZeRO-2, torch.compile, and FP8 via torchao. According to the post, that run took about 16 days, avoided NaNs and OOMs, and sustained about 28% MFU. Phase 2 is now extending context to 4096 with another 20B tokens.

The bigger reason the thread matters is strategic rather than purely benchmark-driven. Most open multilingual models still treat languages like Italian as secondary to English. Dante-2B is making the opposite bet: start with tokenizer efficiency and corpus composition, then scale a smaller model cleanly. The author says weights, tokenizer, config, and the pretraining pipeline will all be released after Phase 2, with an SFT phase planned afterward. Even if Dante-2B remains modest compared with frontier models, the project is a concrete reminder that language-specific quality still depends as much on data and tokenization choices as on raw parameter count.

Share: Long

Related Articles

LLM Hacker News 5d ago 2 min read

Google DeepMind has introduced Gemma 4 as a new open-model family built from Gemini 3 research. The lineup spans E2B and E4B edge models through 26B and 31B local-workstation models, with function calling, multimodal reasoning, and 140-language support at the center of the release.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.