Hacker News spotlights a 25-species mRNA language-model pipeline

Hacker News surfaced an unusually detailed bio/AI build log on April 5, 2026. The post, Training mRNA Language Models Across 25 Species for $165, had 138 points and 32 comments at crawl time, and it links to an OpenMed community article on Hugging Face published March 31, 2026.

The project is not a single model release. It is an end-to-end pipeline that chains ESMFold for structure prediction, ProteinMPNN for sequence design, and CodonRoBERTa for codon optimization. The OpenMed team frames the goal in practical terms: start from a therapeutic protein concept, predict a structure, design sequences that can fold into it, and end with synthesis-ready DNA optimized for expression in a target organism.

The most interesting technical section is the codon-optimization study. OpenMed trained several transformer variants on 250,000 E. coli coding sequences and reports that CodonRoBERTa-large-v2 was the best overall model. The reported numbers are specific enough to matter: perplexity 4.10, Spearman CAI correlation 0.404, and a clear performance gap over ModernBERT, which the authors say underperformed badly on codon data despite newer attention mechanisms. The post argues that domain metrics mattered more than raw MLM loss, because a model can predict masked codons well without learning biologically meaningful preferences.

The multi-species scaling story is also notable. OpenMed says it trained four production models covering 25 species in 55 GPU-hours. In the same write-up, the team reports ESMFold runs on 30 protein chains with average PTM 0.79 and ProteinMPNN sequence recovery of 42% on scaffold 7K00. Those are not proof of therapeutic utility, but they are concrete engineering checkpoints that make the article more than a vague “AI for biology” pitch.

The Hacker News discussion was skeptical in a useful way. One commenter questioned how much verifiable protein data exists for training and whether predicted outputs are biologically useful yet. Another commenter identifying as a structural biologist said this sort of system could be widely useful if it works. That split reaction is the right frame: the pipeline looks reproducible and inexpensive for open research, but wet-lab validation remains the real bottleneck.

Hacker News spotlights a 25-species mRNA language-model pipeline

Related Articles

Google Expands Wildlife Monitoring With Open-Source SpeciesNet

Google Expands Wildlife Monitoring With Open-Source SpeciesNet

Google DeepMind and EMBL-EBI add millions of protein complexes to AlphaFold Database

Comments (0)

Leave a Comment

Related Articles

Google Expands Wildlife Monitoring With Open-Source SpeciesNet
Sciences Mar 6, 2026 1 min read

Google Expands Wildlife Monitoring With Open-Source SpeciesNet
Sciences Mar 6, 2026 1 min read

Google DeepMind and EMBL-EBI add millions of protein complexes to AlphaFold Database
Sciences sources.twitter Mar 17, 2026 2 min read