#benchmark

LLM Feb 27, 2026 2 min read

OpenAI and Paradigm introduce EVMbench for smart contract security testing

OpenAI and Paradigm launched EVMbench, a benchmark for AI agent performance on smart contract detection, patching, and exploitation tasks. OpenAI reports GPT-5.3-Codex scored 72.2% in exploit mode versus 31.9% for GPT-5.

#security #smart-contracts #benchmark

LLM Reddit Feb 27, 2026 2 min read

OpenAI Pauses SWE-bench Verified Evaluations After 16.4% Flaw Finding

A trending Reddit post in r/singularity points to OpenAI's statement that it no longer evaluates on SWE-bench Verified, citing at least 16.4% flawed test cases. The announcement reframes how coding-model benchmark scores should be interpreted in production decision-making.

#openai #swe-bench #benchmark

AI X/Twitter Feb 24, 2026 1 min read

OpenAI Launches EVMbench: New Standard for Measuring AI Agents in Smart Contract Security

OpenAI introduced EVMbench, a new benchmark measuring how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities in EVM-based blockchains.

#openai #benchmark #smart-contracts

101

LLM Hacker News Feb 24, 2026 1 min read

The "Car Wash" Test: Only 11 of 53 AI Models Pass a Simple Logic Question

Opper tested 53 leading LLMs with a deceptively simple logic question about whether to walk or drive to a car wash 50 meters away. Only 11 models answered correctly — the car must be driven to the car wash.

#llm #benchmark #reasoning

LLM Reddit Feb 24, 2026 1 min read

GLM-5 Becomes Top Open-Weights Model on Extended NYT Connections Benchmark

Zhipu AI's GLM-5 has claimed the top spot among open-weights models on the Extended NYT Connections benchmark with a score of 81.8, edging out Kimi K2.5 Thinking's 78.3.

#glm-5 #benchmark #open-weights

LLM Feb 24, 2026 1 min read

Google Releases Gemini 3.1 Pro: 77.1% ARC-AGI-2, Doubled Reasoning Performance

Google DeepMind released Gemini 3.1 Pro on February 19, achieving 77.1% on ARC-AGI-2—more than double its predecessor's 31.1%—with a 1M-token context window and 80.6% on SWE-Bench Verified.

#google #gemini #benchmark

104

LLM Feb 23, 2026 1 min read

DeepSeek V4 Launches: 1 Trillion Parameters, 1M Context, Open-Weight

DeepSeek released V4 on Lunar New Year with 1 trillion parameters, 1M-token context windows, and novel mHC architecture. The open-weight model claims benchmark-topping coding performance at 10–40× lower inference costs than Western frontier models.

#deepseek #open-source #benchmark

LLM Feb 23, 2026 1 min read

Alibaba Releases Qwen3.5: Open-Weight MoE Model Claims to Beat US Rivals

Alibaba launched Qwen3.5, a 397B-parameter open-weight multimodal model supporting 201 languages. The company claims it outperforms GPT-5.2, Claude Opus 4.5, and Gemini 3 on benchmarks, while costing 60% less than its predecessor.

#alibaba #qwen #open-source

LLM Feb 23, 2026 1 min read

Google Releases Gemini 3.1 Pro with 77.1% on ARC-AGI-2

Google's Gemini 3.1 Pro achieves 77.1% on ARC-AGI-2—more than doubling the previous Gemini 3 Pro's score. The mid-cycle upgrade brings Deep Think-level reasoning capabilities to all users and developers.

#google #gemini #benchmark

101

LLM Reddit Feb 23, 2026 1 min read

Qwen Team Confirms Serious Data Quality Problems in GPQA and HLE Benchmarks

The Qwen research team has officially confirmed through a published paper that GPQA and HLE (Humanity's Last Exam) benchmark datasets contain serious quality issues — including OCR errors, incorrect gold-standard answers, and unverifiable questions — casting doubt on the reliability of current AI model evaluations.

#qwen #benchmark #gpqa

LLM X/Twitter Feb 22, 2026 1 min read

Google DeepMind Releases Gemini 3.1 Pro: 2x Reasoning Boost and Record Benchmark Scores

Google DeepMind has released Gemini 3.1 Pro with over 2x reasoning performance versus Gemini 3 Pro. The model scores 77.1% on ARC-AGI-2 (up from 31.1%), 80.6% on SWE-bench Verified, and tops 12 of 18 tracked benchmarks at unchanged $2/$12 per million token pricing.

#gemini #google-deepmind #llm

107

LLM Reddit Feb 22, 2026 1 min read

Claude Opus 4.6 Hits 14.5-Hour Mark on METR's Software Task Benchmark

Claude Opus 4.6 achieved a 50%-time-horizon of approximately 14.5 hours on METR's software task benchmark — beating all predictions and suggesting a doubling time of under 3 months for AI task capabilities.

#claude #anthropic #metr

114