OpenAI and Paradigm launched EVMbench, a benchmark for AI agent performance on smart contract detection, patching, and exploitation tasks. OpenAI reports GPT-5.3-Codex scored 72.2% in exploit mode versus 31.9% for GPT-5.
#benchmark
RSS FeedA trending Reddit post in r/singularity points to OpenAI's statement that it no longer evaluates on SWE-bench Verified, citing at least 16.4% flawed test cases. The announcement reframes how coding-model benchmark scores should be interpreted in production decision-making.
OpenAI introduced EVMbench, a new benchmark measuring how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities in EVM-based blockchains.
Opper tested 53 leading LLMs with a deceptively simple logic question about whether to walk or drive to a car wash 50 meters away. Only 11 models answered correctly — the car must be driven to the car wash.
Zhipu AI's GLM-5 has claimed the top spot among open-weights models on the Extended NYT Connections benchmark with a score of 81.8, edging out Kimi K2.5 Thinking's 78.3.
Google DeepMind released Gemini 3.1 Pro on February 19, achieving 77.1% on ARC-AGI-2—more than double its predecessor's 31.1%—with a 1M-token context window and 80.6% on SWE-Bench Verified.
DeepSeek released V4 on Lunar New Year with 1 trillion parameters, 1M-token context windows, and novel mHC architecture. The open-weight model claims benchmark-topping coding performance at 10–40× lower inference costs than Western frontier models.
Alibaba launched Qwen3.5, a 397B-parameter open-weight multimodal model supporting 201 languages. The company claims it outperforms GPT-5.2, Claude Opus 4.5, and Gemini 3 on benchmarks, while costing 60% less than its predecessor.
Google's Gemini 3.1 Pro achieves 77.1% on ARC-AGI-2—more than doubling the previous Gemini 3 Pro's score. The mid-cycle upgrade brings Deep Think-level reasoning capabilities to all users and developers.
The Qwen research team has officially confirmed through a published paper that GPQA and HLE (Humanity's Last Exam) benchmark datasets contain serious quality issues — including OCR errors, incorrect gold-standard answers, and unverifiable questions — casting doubt on the reliability of current AI model evaluations.
Google DeepMind has released Gemini 3.1 Pro with over 2x reasoning performance versus Gemini 3 Pro. The model scores 77.1% on ARC-AGI-2 (up from 31.1%), 80.6% on SWE-bench Verified, and tops 12 of 18 tracked benchmarks at unchanged $2/$12 per million token pricing.
Claude Opus 4.6 achieved a 50%-time-horizon of approximately 14.5 hours on METR's software task benchmark — beating all predictions and suggesting a doubling time of under 3 months for AI task capabilities.