Articles

All AI LLM Humanoid Robots Sciences Gaming Finance

Source:

From To

LLM X/Twitter Jul 10, 2026 2 min read

OpenAI says 30% of SWE-Bench Pro is broken and drops its recommendation

OpenAI says SWE-Bench Pro no longer reliably measures frontier coding capability after finding 30% of its public tasks broken. The cited issues include hidden requirements, contradictory instructions, strict tests and incomplete grading criteria.

#openai #swe-bench #coding-agents

LLM X/Twitter Jul 10, 2026 2 min read

GPT-5.6 reaches ChatGPT, Codex and API with an 80.0 agent score

GPT-5.6 moved from preview into access across ChatGPT, Codex and the OpenAI API. OpenAI paired the rollout with an 80.0 Coding Agent Index score, 2.8 points above Claude Fable 5, while claiming lower token use, time and cost.

#openai #gpt-5-6 #codex

LLM Jul 3, 2026 2 min read

SkillOpt lifts agent scores by 23.5 points without changing weights

Microsoft Research turned agent skill files into trainable artifacts. SkillOpt raised GPT-5.5’s six-benchmark direct-chat average from 58.8 to 82.3 and improved all or tied for best across 52 evaluation cells without updating model weights.

#microsoft-research #agents #skillopt

Sciences X/Twitter Jul 1, 2026 1 min read

GeneBench-Pro turns biology-agent testing into 129 hard problems

Biology agents are being judged on research judgment, not just factual answers. GeneBench-Pro puts 129 computational-biology problems in front of agents, and indexed coverage says GPT-5.6 Sol reaches 28.7% at the highest reasoning level and 31.5% in Pro mode.

#openai #genebench-pro #biology

LLM Hacker News Jun 30, 2026 1 min read

Ornith-1.0 tests the open-model bar for agentic coding

HN interest centered on whether the model feels useful in real coding loops, not just on the benchmark table.

#ornith #coding-agents #open-models

LLM Jun 30, 2026 1 min read

Arena turns 10M model votes into a $100M AI-evaluation business

Arena says its commercial AI evaluation service has reached a $100M annualized run rate just eight months after launch. The milestone shows how crowdsourced model preferences are becoming paid infrastructure for labs and enterprises.

#arena #benchmarks #evaluations

LLM X/Twitter Jun 30, 2026 1 min read

OpenRouter links live GPQA and TAU-Bench scores to tool-call routing

OpenRouter says it continuously runs GPQA and TAU-Bench on open-weight models and feeds the results into AutoExacto routing. The linked GLM 5.2 page pairs benchmark rankings with production details such as a 1M-token context window and $0.94/$3 per 1M token pricing.

#openrouter #benchmarks #routing

LLM X/Twitter Jun 30, 2026 1 min read

GitHub Copilot harness matches native agents across five coding benches

GitHub compared the Copilot agentic harness against native model harnesses on five task suites. With the model and task held fixed, it claims comparable task resolution and fewer tokens across most configurations.

#github #copilot #agents

LLM X/Twitter Jun 29, 2026 2 min read

Four open-weight models move from cheap tokens into agent pipelines

Open-weight LLMs are moving from cost comparisons into production agent design. OpenRouter singled out four June 2026 models, including DeepSeek V4 Flash at 79.0% on SWE-bench Verified and GLM 5.2 as the top open model on Artificial Analysis v4.1.

#openrouter #open-weight #benchmarks

LLM Jun 28, 2026 2 min read

Open-weight models narrow the gap to 3-6 months, OpenRouter says

OpenRouter’s June review frames open-weight competition around four models: DeepSeek V4 Flash, GLM 5.2, MiniMax M3, and NVIDIA Nemotron 3 Ultra. The numbers that matter are 79.0% on SWE-bench Verified, an Intelligence Index score of 51, 1M-token contexts, and sharply lower serving costs.

#openrouter #open-weight #llm

LLM X/Twitter Jun 26, 2026 2 min read

OpenRouter Benchmarks API lets agents query live model rankings

Model choice is becoming a runtime routing problem instead of a static leaderboard check. OpenRouter says its Benchmarks API exposes live scores, including Artificial Analysis and Design Arena, and points to GLM-5.2 leading both coding and design among available models.

#openrouter #benchmarks #glm-5.2

LLM Hacker News Jun 18, 2026 1 min read

GLM-5.2 pushes open weights into the cost-versus-reasoning debate

The community debate moved beyond rank: GLM-5.2 looks strong, but output-token hunger and latency now matter as much as benchmark position.

#glm #open-weights #benchmarks