Claude Opus 4.6, 벤치마크 기록 경신하며 GPT-5.2 압도

기록적인 추론 벤치마크

2월 4일 공개된 Claude Opus 4.6은 Anthropic의 새 플래그십 모델로, 추론과 코딩 능력에서 새로운 기준을 제시했다. ARC AGI 2 벤치마크(인간에게는 쉽지만 AI에게는 매우 어려운 문제)에서 68.8%를 기록하며, OpenAI GPT-5.2의 54.2%, Google Gemini 3 Pro를 넘어섰다.

실무 작업에서의 우위

금융·법률 등 전문 업무를 평가하는 GDPval-AA 벤치마크에서 Opus 4.6은 GPT-5.2보다 약 144 Elo 포인트 앞섰다. 또한 Terminal Bench에서 65.4%(4.5 대비 59.8%에서 상승), OSWorld 에이전틱 벤치마크에서 72.7%(66.3%에서 상승)를 기록하며 컴퓨터 사용 에이전트로서의 성능을 입증했다.

100만 토큰 컨텍스트 윈도우

Opus 4.6은 100만 토큰 컨텍스트 윈도우를 지원하며, 긴 컨텍스트 성능을 평가하는 MRCR v2 8-needle 1M 벤치마크에서 76%를 기록했다(Sonnet 4.5는 18.5%).

산업 파급효과

Opus 4.6 출시 직후 GitHub, Atlassian, ServiceNow 등 소프트웨어 기업 주가가 하락했다. 분석가들은 강력한 AI 코딩 도구가 개발자 생산성 소프트웨어 수요를 줄일 수 있다고 우려했다.

가용성

Opus 4.6은 Claude.ai, Claude Code, API를 통해 이용 가능하다. 에이전트 팀 기능과 통합되어 복잡한 워크플로우를 자동화할 수 있다.

Source: Anthropic, The New Stack, WinBuzzer

LLM X/Twitter 4h ago 2 min read

Opus 4.8 reaches ARC-AGI-3 SOTA with 1.5% score and ~$10K run

ARC Prize put Anthropic Opus 4.8 at the top of ARC-AGI-3, but the score shows how hard the benchmark remains. The new mark is 1.5% at roughly $10K, with progress tied to object-and-system abstraction rather than image-level pattern matching.

#anthropic #opus-4-8 #arc-agi

LLM May 18, 2026 1 min read

PwC Expands Anthropic Alliance, Deploying Claude Across Hundreds of Thousands of Professionals

PwC will roll out Claude Code and Cowork across its global workforce, launching a 30,000-staff certification program and a joint Center of Excellence. Insurance underwriting cycles already cut from 10 weeks to 10 days.

#anthropic #product-launch #benchmark

LLM X/Twitter 5d ago 1 min read

Opus 4.8 beats GPT-5.5 by 121 points on GDPval-AA agent benchmark

Claude Opus 4.8 is showing its strongest early signal in agentic work, not only coding. Artificial Analysis says the model scored 1890 on GDPval-AA, 121 points ahead of GPT-5.5 xhigh.

#anthropic #claude #benchmark