Cursor publishes the Composer 2 technical report detailing continued pretraining and large-scale RL for coding agents

Overview

Cursor used X on March 24, 2026 to announce a technical report for Composer 2, its latest model for agentic software engineering. In a follow-up reply, the company linked the full PDF report, which turns the post from a simple launch note into one of the clearest first-party disclosures yet on how a production coding model is trained and evaluated.

According to the report, Composer 2 is a domain-specialized model built for long-horizon coding tasks in the same harness and tool environment used by deployed Cursor agents. Cursor says the goal was to minimize train-test mismatch by training against workflows that resemble real user sessions rather than narrow benchmark prompts.

Training recipe and architecture

Cursor says Composer 2 was trained in two stages. First came continued pretraining on a code-dominated data mix to improve knowledge and latent coding ability. The report says this stage used Kimi K2.5, a 1.04 trillion parameter mixture-of-experts model with 32 billion active parameters, as the base model. After 32k-token training and a long-context extension to 256k tokens, Cursor added a short supervised fine-tuning phase on targeted coding tasks.

The second stage was large-scale reinforcement learning in environments designed to mirror real Cursor sessions. The company says it trained against tasks spanning debugging, new features, refactors, documentation, testing, code review, DevOps, and migrations. The report also describes self-summarization for long trajectories, multi-token prediction for faster serving, and reward shaping to balance speed, tool use, and code quality.

Benchmark results

On evaluation, Cursor reports 61.3% on CursorBench, 73.7% on SWE-bench Multilingual, and 61.7% on Terminal-Bench in its harness. It frames the result as frontier-level coding performance at lower serving cost than state-of-the-art model API pricing. The most interesting claim is not just the benchmark table, but the methodology: CursorBench is built from real internal engineering sessions, with larger code changes and shorter, less-specified prompts than public coding benchmarks.

That makes the report worth watching beyond Cursor itself. As coding agents move from autocomplete into longer autonomous workflows, first-party transparency about training environments, reward design, and benchmark construction is becoming strategically important. Primary source: Composer 2 Technical Report.

Cursor publishes the Composer 2 technical report detailing continued pretraining and large-scale RL for coding agents

Overview

Training recipe and architecture

Benchmark results

Related Articles

Cursor details Composer 2’s training stack, from continued pretraining to real-world RL

DeepSWE’s 113 tasks put GPT-5.5 at 70% and Claude Opus 4.7 at 54%

Devin hits $492M run-rate as Cognition bets on independent agents

Comments (0)

Leave a Comment