Hacker News spotlights a 25-species mRNA language-model pipeline
Original: Training mRNA Language Models Across 25 Species for $165 View original →
Hacker News surfaced an unusually detailed bio/AI build log on April 5, 2026. The post, Training mRNA Language Models Across 25 Species for $165, had 138 points and 32 comments at crawl time, and it links to an OpenMed community article on Hugging Face published March 31, 2026.
The project is not a single model release. It is an end-to-end pipeline that chains ESMFold for structure prediction, ProteinMPNN for sequence design, and CodonRoBERTa for codon optimization. The OpenMed team frames the goal in practical terms: start from a therapeutic protein concept, predict a structure, design sequences that can fold into it, and end with synthesis-ready DNA optimized for expression in a target organism.
The most interesting technical section is the codon-optimization study. OpenMed trained several transformer variants on 250,000 E. coli coding sequences and reports that CodonRoBERTa-large-v2 was the best overall model. The reported numbers are specific enough to matter: perplexity 4.10, Spearman CAI correlation 0.404, and a clear performance gap over ModernBERT, which the authors say underperformed badly on codon data despite newer attention mechanisms. The post argues that domain metrics mattered more than raw MLM loss, because a model can predict masked codons well without learning biologically meaningful preferences.
The multi-species scaling story is also notable. OpenMed says it trained four production models covering 25 species in 55 GPU-hours. In the same write-up, the team reports ESMFold runs on 30 protein chains with average PTM 0.79 and ProteinMPNN sequence recovery of 42% on scaffold 7K00. Those are not proof of therapeutic utility, but they are concrete engineering checkpoints that make the article more than a vague “AI for biology” pitch.
The Hacker News discussion was skeptical in a useful way. One commenter questioned how much verifiable protein data exists for training and whether predicted outputs are biologically useful yet. Another commenter identifying as a structural biologist said this sort of system could be widely useful if it works. That split reaction is the right frame: the pipeline looks reproducible and inexpensive for open research, but wet-lab validation remains the real bottleneck.
Related Articles
Google DeepMind said on X that it is expanding AlphaFold Database with millions of AI-predicted protein complex structures in collaboration with EMBL-EBI, NVIDIA, and Seoul National University. The release pushes AlphaFold beyond single-protein structure prediction toward a broader public resource for studying how proteins interact.
An HN discussion around Cloudflare’s roadmap highlights a security story with direct IT relevance: the company now targets 2029 for full post-quantum protection, including authentication, because recent quantum and algorithmic advances are compressing the migration timeline.
In an April 7, 2026 post on X, OpenAI’s Kevin Weil introduced Paper Review, a new Prism workflow for reviewing technical and scientific papers. He said the tool goes beyond grammar, checking math, notation, units, structure, and evidence support, then writes an editable LaTeX review file back into the project.
Comments (0)
No comments yet. Be the first to comment!