r/MachineLearning Follows Dante-2B, a Bilingual Italian/English LLM Trained From Scratch on 2×H200

A r/MachineLearning post from researcher and entrepreneur angeletti89 has started drawing attention to Dante-2B, a 2.1B parameter Italian/English model being trained from scratch on 2× H200 GPUs. The author says the project is not a fine-tune of Llama or Mistral but a fresh dense decoder-only transformer with 28 layers, d_model=2560, GQA, SwiGLU, RMSNorm, RoPE, and a custom 64K BPE tokenizer tuned specifically for Italian, English, and code.

The tokenizer is the center of the argument. The post points out that English-centric tokenizers often split Italian contractions like l'intelligenza inefficiently, wasting context window and harming morphology handling. Dante-2B's tokenizer was trained on a character-balanced mixture of roughly 42% Italian, 36% English, and 22% code, with custom pre-tokenization rules that keep apostrophe contractions intact and treat accented characters as atomic units.

A small-model recipe aimed at language efficiency

The training setup is unusually detailed for a community post. The author describes a roughly 300B token corpus built from FineWeb-2 Italian, FineWeb-Edu, 171K Italian public-domain books, legal and parliamentary text, bilingual Wikipedia, and StarCoderData. Phase 1 has already finished: 100B tokens at sequence length 2048 using DeepSpeed ZeRO-2, torch.compile, and FP8 via torchao. According to the post, that run took about 16 days, avoided NaNs and OOMs, and sustained about 28% MFU. Phase 2 is now extending context to 4096 with another 20B tokens.

The bigger reason the thread matters is strategic rather than purely benchmark-driven. Most open multilingual models still treat languages like Italian as secondary to English. Dante-2B is making the opposite bet: start with tokenizer efficiency and corpus composition, then scale a smaller model cleanly. The author says weights, tokenizer, config, and the pretraining pipeline will all be released after Phase 2, with an SFT phase planned afterward. Even if Dante-2B remains modest compared with frontier models, the project is a concrete reminder that language-specific quality still depends as much on data and tokenization choices as on raw parameter count.

r/MachineLearning Follows Dante-2B, a Bilingual Italian/English LLM Trained From Scratch on 2×H200

A small-model recipe aimed at language efficiency

Related Articles

GLM-5.2 turns 1M context into a coding-agent benchmark fight

OCR model competition is moving toward ingestion quality

Ornith-1.0 tests the open-model bar for agentic coding

Related Articles

GLM-5.2 turns 1M context into a coding-agent benchmark fight
LLM Jun 18, 2026 1 min read

OCR model competition is moving toward ingestion quality
LLM Reddit Jun 24, 2026 2 min read

Ornith-1.0 tests the open-model bar for agentic coding
LLM Hacker News Jun 30, 2026 1 min read