LiteCoder pushes terminal agents to 31.5% on Terminal Bench Pro
Original: Releasing LiteCoder-Terminal-SFT View original →
Terminal agents still have a data problem, and LiteCoder is trying to solve it with more than a new checkpoint. In a Hugging Face community article published on 2026-04-13, the LiteCoder team released LiteCoder-Terminal-SFT alongside a full training dataset of 11,255 trajectories and 602 standard Harbor terminal environments with complete test cases. That combination is what makes the story interesting. Plenty of teams ship model weights. Far fewer ship the executable environments and task structure that let other groups reproduce, stress, and extend agent training for terminal work.
The release covers two models, LiteCoder-Terminal-30b-a3b-sft and LiteCoder-Terminal-4b-sft, plus multiple datasets. The article says the new training pipeline expanded beyond a Terminus-only setup and now includes trajectories from Claude Code and OpenHands as well. That matters because terminal agents break in different ways depending on the scaffold wrapped around them. LiteCoder says the final dataset spans 10 task categories with an average of 27.4 turns per trajectory, and the mixture is 86.6% Terminus-2, 7.1% OpenHands, and 6.3% Claude Code. In other words, the release is trying to push cross-scaffold generalization instead of optimizing for a single harness.
The benchmark table is the part that will get the most attention. On Terminal Bench 1.0, the 30B model posts 24.38% Pass@1, ahead of Qwen3-30B-A3B-Nex-N1 at 18.44% and well above the LiteCoder preview at 16.56%. On Terminal Bench 2.0, the 30B model reaches 12.36%, matching Qwen3-30B-A3B-Nex-N1 and nearly doubling the 6.18% preview. On Terminal Bench Pro, the same 30B model lands at 31.5% Pass@1, up from 22.0% in the preview and ahead of Qwen3-30B-A3B-Nex-N1 at 21.0%. The 4B model is also notable: LiteCoder reports 15.5% on Terminal Bench Pro versus 3.5% for Qwen3-4B-Instruct.
There is a second story inside the release. LiteCoder also publishes exploratory data for terminal state prediction, arguing that real-time interaction is still too computationally expensive for reinforcement learning at scale and that better world modeling could ease that bottleneck. The team says 4B-scale models still drift badly when simulating environment dynamics, which is a useful reminder that coding-agent progress is not just about tool calling or longer context windows. If this release matters beyond one benchmark cycle, it will be because the open environments and trajectory data help the rest of the field train agents that can actually survive a messy shell session.
Related Articles
A March 20, 2026 Hacker News thread sent OpenCode up the charts, highlighting demand for a provider-agnostic coding agent with a TUI, built-in build/plan modes, and open deployment paths.
A Hacker News post pushed ATLAS into the spotlight by framing a consumer-GPU coding agent as a serious cost challenger to hosted systems. The headline benchmark is interesting, but the repository itself makes clear that its 74.6% result is not a controlled head-to-head against Claude 4.5 Sonnet because the task counts and evaluation protocols differ.
Cursor said on March 26, 2026 that real-time reinforcement learning lets it ship improved Composer 2 checkpoints every five hours. Cursor’s March 27 technical report says the model combines continued pretraining on Kimi K2.5 with large-scale RL in realistic Cursor sessions, scores 61.3 on CursorBench, and runs on an asynchronous multi-region RL stack with large sandbox fleets.
Comments (0)
No comments yet. Be the first to comment!