Ornith-1.0 tests the open-model bar for agentic coding
Original: Ornith-1.0: self-improving open-source models for agentic coding View original →
Ornith-1.0 arrived as a set of open models aimed directly at agentic coding. The project README lists a 9B dense model plus 35B and 397B mixture-of-experts checkpoints, post-trained on Gemma 4 and Qwen 3.5 bases. It also emphasizes an MIT license, global availability, and deployment recipes for vLLM, SGLang, Transformers, llama.cpp, and Ollama-style local use.
The headline numbers are coding-agent benchmarks. The README compares the models across Terminal-Bench 2.1, SWE-bench Verified, SWE-bench Pro, SWE-bench Multilingual, NL2Repo, and ClawEval under stated harness settings. That gave HN enough material to debate the release, but the better discussion was about practical behavior: whether smaller open coding models now feel useful inside real development loops.
Several commenters focused on the 35B variant. Early users reported running quantized or FP8 versions locally, with one comparing it favorably to Qwen 3.6 35B-style models because it produced shorter reasoning traces and avoided some long loops. Other comments were more skeptical, asking who DeepReinforce is, whether the model is essentially a Qwen derivative, and what “self-improving” means outside the training framework.
That mix is the real signal. Open coding models are no longer judged only by a SWE-bench row. Developers want released weights, usable serving instructions, long context, tool-call parsing, reasoning separation, and enough speed to sit inside an agent loop without turning every task into a long wait. Ornith-1.0 is interesting because it packages those claims in one release, while still leaving provenance and replication questions for the community to test.
Source: Ornith-1.0 README, HN discussion.
Related Articles
Z.AI is pitching GLM-5.2 as a long-horizon coding model, not just another long-context release. Its docs claim 1M lossless context, 128K maximum output, 81.0 on Terminal-Bench 2.1, and a 1% gap behind Claude Opus 4.8 on FrontierSWE.
The r/MachineLearning post drew attention because OCR is becoming a measurable ingestion layer for agents and RAG, not just a text extraction demo.
Model choice is becoming a runtime routing problem instead of a static leaderboard check. OpenRouter says its Benchmarks API exposes live scores, including Artificial Analysis and Design Arena, and points to GLM-5.2 leading both coding and design among available models.