r/MachineLearning Follows Dante-2B, a Bilingual Italian/English LLM Trained From Scratch on 2×H200
Original: [P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built. View original →
A r/MachineLearning post from researcher and entrepreneur angeletti89 has started drawing attention to Dante-2B, a 2.1B parameter Italian/English model being trained from scratch on 2× H200 GPUs. The author says the project is not a fine-tune of Llama or Mistral but a fresh dense decoder-only transformer with 28 layers, d_model=2560, GQA, SwiGLU, RMSNorm, RoPE, and a custom 64K BPE tokenizer tuned specifically for Italian, English, and code.
The tokenizer is the center of the argument. The post points out that English-centric tokenizers often split Italian contractions like l'intelligenza inefficiently, wasting context window and harming morphology handling. Dante-2B's tokenizer was trained on a character-balanced mixture of roughly 42% Italian, 36% English, and 22% code, with custom pre-tokenization rules that keep apostrophe contractions intact and treat accented characters as atomic units.
A small-model recipe aimed at language efficiency
The training setup is unusually detailed for a community post. The author describes a roughly 300B token corpus built from FineWeb-2 Italian, FineWeb-Edu, 171K Italian public-domain books, legal and parliamentary text, bilingual Wikipedia, and StarCoderData. Phase 1 has already finished: 100B tokens at sequence length 2048 using DeepSpeed ZeRO-2, torch.compile, and FP8 via torchao. According to the post, that run took about 16 days, avoided NaNs and OOMs, and sustained about 28% MFU. Phase 2 is now extending context to 4096 with another 20B tokens.
The bigger reason the thread matters is strategic rather than purely benchmark-driven. Most open multilingual models still treat languages like Italian as secondary to English. Dante-2B is making the opposite bet: start with tokenizer efficiency and corpus composition, then scale a smaller model cleanly. The author says weights, tokenizer, config, and the pretraining pipeline will all be released after Phase 2, with an SFT phase planned afterward. Even if Dante-2B remains modest compared with frontier models, the project is a concrete reminder that language-specific quality still depends as much on data and tokenization choices as on raw parameter count.
Related Articles
DeepSeek turned a temporary V4-Pro API discount into standard pricing, intensifying the cost race around frontier-class LLM access. The posted table cuts output pricing from $3.48 to $0.87 per million tokens.
The weak point in model leaderboards may be the tasks, not only the models. A new arXiv paper reports critical issues in more than 25.7% of evaluated benchmark tasks and shows ranking shifts after filtering flawed items.
The money is following the layer that decides which model gets each request. OpenRouter says weekly traffic rose 5x in six months to 25 trillion tokens, while its platform now spans 400+ models and more than 8 million users.
Comments (0)
No comments yet. Be the first to comment!