Cursor publishes the Composer 2 technical report detailing continued pretraining and large-scale RL for coding agents
Original: Cursor publishes the Composer 2 technical report detailing continued pretraining and large-scale RL for coding agents View original →
Overview
Cursor used X on March 24, 2026 to announce a technical report for Composer 2, its latest model for agentic software engineering. In a follow-up reply, the company linked the full PDF report, which turns the post from a simple launch note into one of the clearest first-party disclosures yet on how a production coding model is trained and evaluated.
According to the report, Composer 2 is a domain-specialized model built for long-horizon coding tasks in the same harness and tool environment used by deployed Cursor agents. Cursor says the goal was to minimize train-test mismatch by training against workflows that resemble real user sessions rather than narrow benchmark prompts.
Training recipe and architecture
Cursor says Composer 2 was trained in two stages. First came continued pretraining on a code-dominated data mix to improve knowledge and latent coding ability. The report says this stage used Kimi K2.5, a 1.04 trillion parameter mixture-of-experts model with 32 billion active parameters, as the base model. After 32k-token training and a long-context extension to 256k tokens, Cursor added a short supervised fine-tuning phase on targeted coding tasks.
The second stage was large-scale reinforcement learning in environments designed to mirror real Cursor sessions. The company says it trained against tasks spanning debugging, new features, refactors, documentation, testing, code review, DevOps, and migrations. The report also describes self-summarization for long trajectories, multi-token prediction for faster serving, and reward shaping to balance speed, tool use, and code quality.
Benchmark results
On evaluation, Cursor reports 61.3% on CursorBench, 73.7% on SWE-bench Multilingual, and 61.7% on Terminal-Bench in its harness. It frames the result as frontier-level coding performance at lower serving cost than state-of-the-art model API pricing. The most interesting claim is not just the benchmark table, but the methodology: CursorBench is built from real internal engineering sessions, with larger code changes and shorter, less-specified prompts than public coding benchmarks.
That makes the report worth watching beyond Cursor itself. As coding agents move from autocomplete into longer autonomous workflows, first-party transparency about training environments, reward design, and benchmark construction is becoming strategically important. Primary source: Composer 2 Technical Report.
Related Articles
Cursor said on March 26, 2026 that real-time reinforcement learning lets it ship improved Composer 2 checkpoints every five hours. Cursor’s March 27 technical report says the model combines continued pretraining on Kimi K2.5 with large-scale RL in realistic Cursor sessions, scores 61.3 on CursorBench, and runs on an asynchronous multi-region RL stack with large sandbox fleets.
DeepSWE reframes coding-agent evaluation with 113 original tasks across 91 repositories. Its first board gives GPT-5.5 a 70.0% pass@1 score, versus 54.2% for Claude Opus 4.7.
Cognition is arguing that coding agents do not have to collapse into model-lab features. It raised more than $1B at a $26B valuation, with Devin’s run-rate revenue reaching $492M.
Comments (0)
No comments yet. Be the first to comment!