A r/MachineLearning post from researcher and entrepreneur angeletti89 has started drawing attention to Dante-2B, a 2.1B parameter Italian/English model being trained from scratch on 2× H200 GPUs. The author says the project is not a fine-tune of Llama or Mistral but a fresh dense decoder-only transformer with 28 layers, d_model=2560, GQA, SwiGLU, RMSNorm, RoPE, and a custom 64K BPE tokenizer tuned specifically for Italian, English, and code.

The tokenizer is the center of the argument. The post points out that English-centric tokenizers often split Italian contractions like l'intelligenza inefficiently, wasting context window and harming morphology handling. Dante-2B's tokenizer was trained on a character-balanced mixture of roughly 42% Italian, 36% English, and 22% code, with custom pre-tokenization rules that keep apostrophe contractions intact and treat accented characters as atomic units.

A small-model recipe aimed at language efficiency

The training setup is unusually detailed for a community post. The author describes a roughly 300B token corpus built from FineWeb-2 Italian, FineWeb-Edu, 171K Italian public-domain books, legal and parliamentary text, bilingual Wikipedia, and StarCoderData. Phase 1 has already finished: 100B tokens at sequence length 2048 using DeepSpeed ZeRO-2, torch.compile, and FP8 via torchao. According to the post, that run took about 16 days, avoided NaNs and OOMs, and sustained about 28% MFU. Phase 2 is now extending context to 4096 with another 20B tokens.

The bigger reason the thread matters is strategic rather than purely benchmark-driven. Most open multilingual models still treat languages like Italian as secondary to English. Dante-2B is making the opposite bet: start with tokenizer efficiency and corpus composition, then scale a smaller model cleanly. The author says weights, tokenizer, config, and the pretraining pipeline will all be released after Phase 2, with an SFT phase planned afterward. Even if Dante-2B remains modest compared with frontier models, the project is a concrete reminder that language-specific quality still depends as much on data and tokenization choices as on raw parameter count.

#italian-nlp

r/MachineLearning Follows Dante-2B, a Bilingual Italian/English LLM Trained From Scratch on 2×H200