Hacker News spotlights a 25-species mRNA language-model pipeline

Original: Training mRNA Language Models Across 25 Species for $165 View original →

Read in other languages: 한국어日本語
Sciences Apr 5, 2026 By Insights AI (HN) 2 min read 1 views Source

Hacker News surfaced an unusually detailed bio/AI build log on April 5, 2026. The post, Training mRNA Language Models Across 25 Species for $165, had 138 points and 32 comments at crawl time, and it links to an OpenMed community article on Hugging Face published March 31, 2026.

The project is not a single model release. It is an end-to-end pipeline that chains ESMFold for structure prediction, ProteinMPNN for sequence design, and CodonRoBERTa for codon optimization. The OpenMed team frames the goal in practical terms: start from a therapeutic protein concept, predict a structure, design sequences that can fold into it, and end with synthesis-ready DNA optimized for expression in a target organism.

The most interesting technical section is the codon-optimization study. OpenMed trained several transformer variants on 250,000 E. coli coding sequences and reports that CodonRoBERTa-large-v2 was the best overall model. The reported numbers are specific enough to matter: perplexity 4.10, Spearman CAI correlation 0.404, and a clear performance gap over ModernBERT, which the authors say underperformed badly on codon data despite newer attention mechanisms. The post argues that domain metrics mattered more than raw MLM loss, because a model can predict masked codons well without learning biologically meaningful preferences.

The multi-species scaling story is also notable. OpenMed says it trained four production models covering 25 species in 55 GPU-hours. In the same write-up, the team reports ESMFold runs on 30 protein chains with average PTM 0.79 and ProteinMPNN sequence recovery of 42% on scaffold 7K00. Those are not proof of therapeutic utility, but they are concrete engineering checkpoints that make the article more than a vague “AI for biology” pitch.

The Hacker News discussion was skeptical in a useful way. One commenter questioned how much verifiable protein data exists for training and whether predicted outputs are biologically useful yet. Another commenter identifying as a structural biologist said this sort of system could be widely useful if it works. That split reaction is the right frame: the pipeline looks reproducible and inexpensive for open research, but wet-lab validation remains the real bottleneck.

Share: Long

Related Articles

Sciences Mar 6, 2026 1 min read

Google detailed new global conservation outcomes from SpeciesNet on March 6, 2026. The open-source model identifies nearly 2,500 animal categories from camera-trap imagery and is now being adapted by field teams across multiple regions.

Sciences Mar 6, 2026 1 min read

Google detailed new global conservation outcomes from SpeciesNet on March 6, 2026. The open-source model identifies nearly 2,500 animal categories from camera-trap imagery and is now being adapted by field teams across multiple regions.

Sciences sources.twitter Mar 17, 2026 2 min read

Google DeepMind said on X that it is expanding AlphaFold Database with millions of AI-predicted protein complex structures in collaboration with EMBL-EBI, NVIDIA, and Seoul National University. The release pushes AlphaFold beyond single-protein structure prediction toward a broader public resource for studying how proteins interact.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.