Hacker News spotlights SWE-CI, a benchmark for long-horizon code-maintenance agents
Original: SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration View original →
A recent Hacker News front-page discussion highlighted SWE-CI, a new arXiv benchmark aimed at something existing coding benchmarks usually hide: whether an agent can keep a codebase healthy over time instead of merely landing a single correct patch. The paper argues that benchmarks such as HumanEval, LiveCodeBench, and SWE-bench mostly reward snapshot performance. In production software, however, requirements arrive in sequence, interfaces shift, and earlier design decisions make later changes easier or harder.
SWE-CI tries to model that reality directly. The benchmark contains 100 tasks drawn from 68 real repositories. Each task pairs a base commit with a later target commit, spanning on average 233 days and 71 consecutive commits. The evaluation starts from the base code, then asks agents to move toward the target through repeated rounds of analysis, implementation, and testing. Instead of a single issue-to-patch jump, the setup follows a Continuous Integration loop.
The protocol uses two roles. An Architect agent reviews failing tests, locates the likely gaps, and writes a short high-level requirements document. A Programmer agent then implements the next change set. The benchmark scores intermediate states with a future-weighted metric called EvoScore, which tries to reward code that stays easy to extend and punish code that accumulates technical debt or regressions. The paper’s framing is clear: maintainability is not visible at one snapshot, so it has to be measured through successive changes.
The early findings are useful for anyone evaluating code agents. The authors report experiments over 18 models from 8 providers, using more than 10 billion tokens in total. They say newer models are improving quickly, and that the Claude Opus family leads their chart. But the more important result is the ceiling: most models still show a zero-regression rate below 0.25 across long-horizon maintenance tasks, which suggests that reliable automated software upkeep remains harder than passing a single benchmark patch test.
That is why the HN attention makes sense. SWE-CI is less about another leaderboard and more about shifting the target from short-term functional correctness to long-term code quality. If coding agents are going to move from demo patches to real maintenance work, benchmarks like this are the kind that will expose whether they can actually survive a live repository.
Primary source: SWE-CI paper
Community source: Hacker News discussion
Project links: GitHub, dataset
Related Articles
AI researcher Andrej Karpathy argues that programming has fundamentally changed over the last two months, particularly since December when coding agents started actually working. Developers are shifting from writing code to directing and managing AI agents in parallel.
Hacker News elevated Bassim Eledath’s eight-level framework, responding to an article that explains coding-agent performance gaps through workflow maturity instead of model benchmarks.
Software developer Manuel Schipper shares a practical workflow for running 4-8 parallel AI coding agents simultaneously using tmux, Markdown Feature Design files, and slash commands — no orchestrators required.
Comments (0)
No comments yet. Be the first to comment!