LiteCoder pushes terminal agents to 31.5% on Terminal Bench Pro

Original: Releasing LiteCoder-Terminal-SFT View original →

Read in other languages: 한국어日本語
LLM Apr 15, 2026 By Insights AI 2 min read 3 views Source

Terminal agents still have a data problem, and LiteCoder is trying to solve it with more than a new checkpoint. In a Hugging Face community article published on 2026-04-13, the LiteCoder team released LiteCoder-Terminal-SFT alongside a full training dataset of 11,255 trajectories and 602 standard Harbor terminal environments with complete test cases. That combination is what makes the story interesting. Plenty of teams ship model weights. Far fewer ship the executable environments and task structure that let other groups reproduce, stress, and extend agent training for terminal work.

The release covers two models, LiteCoder-Terminal-30b-a3b-sft and LiteCoder-Terminal-4b-sft, plus multiple datasets. The article says the new training pipeline expanded beyond a Terminus-only setup and now includes trajectories from Claude Code and OpenHands as well. That matters because terminal agents break in different ways depending on the scaffold wrapped around them. LiteCoder says the final dataset spans 10 task categories with an average of 27.4 turns per trajectory, and the mixture is 86.6% Terminus-2, 7.1% OpenHands, and 6.3% Claude Code. In other words, the release is trying to push cross-scaffold generalization instead of optimizing for a single harness.

The benchmark table is the part that will get the most attention. On Terminal Bench 1.0, the 30B model posts 24.38% Pass@1, ahead of Qwen3-30B-A3B-Nex-N1 at 18.44% and well above the LiteCoder preview at 16.56%. On Terminal Bench 2.0, the 30B model reaches 12.36%, matching Qwen3-30B-A3B-Nex-N1 and nearly doubling the 6.18% preview. On Terminal Bench Pro, the same 30B model lands at 31.5% Pass@1, up from 22.0% in the preview and ahead of Qwen3-30B-A3B-Nex-N1 at 21.0%. The 4B model is also notable: LiteCoder reports 15.5% on Terminal Bench Pro versus 3.5% for Qwen3-4B-Instruct.

There is a second story inside the release. LiteCoder also publishes exploratory data for terminal state prediction, arguing that real-time interaction is still too computationally expensive for reinforcement learning at scale and that better world modeling could ease that bottleneck. The team says 4B-scale models still drift badly when simulating environment dynamics, which is a useful reminder that coding-agent progress is not just about tool calling or longer context windows. If this release matters beyond one benchmark cycle, it will be because the open environments and trajectory data help the rest of the field train agents that can actually survive a messy shell session.

Share: Long

Related Articles

LLM Hacker News Mar 28, 2026 2 min read

A Hacker News post pushed ATLAS into the spotlight by framing a consumer-GPU coding agent as a serious cost challenger to hosted systems. The headline benchmark is interesting, but the repository itself makes clear that its 74.6% result is not a controlled head-to-head against Claude 4.5 Sonnet because the task counts and evaluation protocols differ.

LLM sources.twitter Apr 5, 2026 2 min read

Cursor said on March 26, 2026 that real-time reinforcement learning lets it ship improved Composer 2 checkpoints every five hours. Cursor’s March 27 technical report says the model combines continued pretraining on Kimi K2.5 with large-scale RL in realistic Cursor sessions, scores 61.3 on CursorBench, and runs on an asynchronous multi-region RL stack with large sandbox fleets.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.