Cursor publishes the Composer 2 technical report detailing continued pretraining and large-scale RL for coding agents
Original: Cursor publishes the Composer 2 technical report detailing continued pretraining and large-scale RL for coding agents View original →
Overview
Cursor used X on March 24, 2026 to announce a technical report for Composer 2, its latest model for agentic software engineering. In a follow-up reply, the company linked the full PDF report, which turns the post from a simple launch note into one of the clearest first-party disclosures yet on how a production coding model is trained and evaluated.
According to the report, Composer 2 is a domain-specialized model built for long-horizon coding tasks in the same harness and tool environment used by deployed Cursor agents. Cursor says the goal was to minimize train-test mismatch by training against workflows that resemble real user sessions rather than narrow benchmark prompts.
Training recipe and architecture
Cursor says Composer 2 was trained in two stages. First came continued pretraining on a code-dominated data mix to improve knowledge and latent coding ability. The report says this stage used Kimi K2.5, a 1.04 trillion parameter mixture-of-experts model with 32 billion active parameters, as the base model. After 32k-token training and a long-context extension to 256k tokens, Cursor added a short supervised fine-tuning phase on targeted coding tasks.
The second stage was large-scale reinforcement learning in environments designed to mirror real Cursor sessions. The company says it trained against tasks spanning debugging, new features, refactors, documentation, testing, code review, DevOps, and migrations. The report also describes self-summarization for long trajectories, multi-token prediction for faster serving, and reward shaping to balance speed, tool use, and code quality.
Benchmark results
On evaluation, Cursor reports 61.3% on CursorBench, 73.7% on SWE-bench Multilingual, and 61.7% on Terminal-Bench in its harness. It frames the result as frontier-level coding performance at lower serving cost than state-of-the-art model API pricing. The most interesting claim is not just the benchmark table, but the methodology: CursorBench is built from real internal engineering sessions, with larger code changes and shorter, less-specified prompts than public coding benchmarks.
That makes the report worth watching beyond Cursor itself. As coding agents move from autocomplete into longer autonomous workflows, first-party transparency about training environments, reward design, and benchmark construction is becoming strategically important. Primary source: Composer 2 Technical Report.
Related Articles
Cursor said on March 26, 2026 that real-time reinforcement learning lets it ship improved Composer checkpoints as often as every five hours. Cursor's research post says the loop trains on billions of production tokens from real user interactions, runs evals including CursorBench before deployment, and has already shown gains in edit persistence, dissatisfied follow-ups, and latency.
A Show HN post points to llm-circuit-finder, a toolkit that duplicates selected transformer layers inside GGUF models and claims sizable reasoning gains without changing weights or running fine-tuning. The strongest benchmark numbers come from the project author’s own evaluations rather than independent validation.
A benchmark thread on r/LocalLLaMA compared ROCm 7 nightlies and Vulkan on an AMD Mi50 for llama.cpp, arguing that Vulkan wins short dense workloads while ROCm pulls ahead on long context and some MoE scenarios.
Comments (0)
No comments yet. Be the first to comment!