r/MachineLearning Follows Dante-2B, a Bilingual Italian/English LLM Trained From Scratch on 2×H200
Original: [P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built. View original →
A r/MachineLearning post from researcher and entrepreneur angeletti89 has started drawing attention to Dante-2B, a 2.1B parameter Italian/English model being trained from scratch on 2× H200 GPUs. The author says the project is not a fine-tune of Llama or Mistral but a fresh dense decoder-only transformer with 28 layers, d_model=2560, GQA, SwiGLU, RMSNorm, RoPE, and a custom 64K BPE tokenizer tuned specifically for Italian, English, and code.
The tokenizer is the center of the argument. The post points out that English-centric tokenizers often split Italian contractions like l'intelligenza inefficiently, wasting context window and harming morphology handling. Dante-2B's tokenizer was trained on a character-balanced mixture of roughly 42% Italian, 36% English, and 22% code, with custom pre-tokenization rules that keep apostrophe contractions intact and treat accented characters as atomic units.
A small-model recipe aimed at language efficiency
The training setup is unusually detailed for a community post. The author describes a roughly 300B token corpus built from FineWeb-2 Italian, FineWeb-Edu, 171K Italian public-domain books, legal and parliamentary text, bilingual Wikipedia, and StarCoderData. Phase 1 has already finished: 100B tokens at sequence length 2048 using DeepSpeed ZeRO-2, torch.compile, and FP8 via torchao. According to the post, that run took about 16 days, avoided NaNs and OOMs, and sustained about 28% MFU. Phase 2 is now extending context to 4096 with another 20B tokens.
The bigger reason the thread matters is strategic rather than purely benchmark-driven. Most open multilingual models still treat languages like Italian as secondary to English. Dante-2B is making the opposite bet: start with tokenizer efficiency and corpus composition, then scale a smaller model cleanly. The author says weights, tokenizer, config, and the pretraining pipeline will all be released after Phase 2, with an SFT phase planned afterward. Even if Dante-2B remains modest compared with frontier models, the project is a concrete reminder that language-specific quality still depends as much on data and tokenization choices as on raw parameter count.
Related Articles
Google said on April 2, 2026 that Gemma 4 is its most capable open model family so far, built from the same technology base as Gemini 3. Google says the family spans E2B, E4B, 26B MoE, and 31B Dense models, adds function-calling and structured JSON support, and offers up to 256K context with an Apache 2.0 license.
r/LocalLLaMA pushed Gemma 4 into one of the strongest community signals in this crawl as Google shipped an open model family spanning edge devices through workstation-class local servers.
Google DeepMind has introduced Gemma 4 as a new open-model family built from Gemini 3 research. The lineup spans E2B and E4B edge models through 26B and 31B local-workstation models, with function calling, multimodal reasoning, and 140-language support at the center of the release.
Comments (0)
No comments yet. Be the first to comment!