#benchmarks

AI sources.twitter 3d ago 2 min read

ParseBench brings 2,000 enterprise pages and 167K OCR rules to Kaggle

Why it matters: enterprise OCR failures break agents long before they show up on academic PDF benchmarks. LlamaIndex says ParseBench evaluates about 2,000 human-verified pages with over 167,000 rules across 14 methods on Kaggle.

#llamaindex #parsebench #ocr

LLM sources.twitter 3d ago 2 min read

Qwen3.6-27B beats Qwen3.5-397B on coding and ships under Apache 2.0

Why it matters: an open-weight 27B dense model is now being pitched against much larger coding systems on real agent tasks. Qwen’s own model card lists SWE-bench Verified at 77.2 for Qwen3.6-27B versus 76.2 for Qwen3.5-397B-A17B, with Apache 2.0 licensing.

#qwen #open-weights #coding-models

LLM sources.twitter 3d ago 2 min read

GPT-5.5 jumps 3 points clear on Artificial Analysis, but cost rises 20%

Why it matters: this is one of the first external benchmark reads to land right after the GPT-5.5 launch. Artificial Analysis said GPT-5.5 moved 3 points clear on its Intelligence Index, while the full index run still became roughly 20% more expensive.

#gpt-5-5 #artificial-analysis #benchmarks

LLM sources.twitter 3d ago 1 min read

Cohere W4A8 vLLM path claims 58% faster first-token latency

Why it matters: inference cost is now a product constraint, not only an infrastructure problem. Cohere said its W4A8 path in vLLM is up to 58% faster on TTFT and 45% faster on TPOT versus W4A16 on Hopper.

#cohere #vllm #inference

AI sources.twitter 3d ago 1 min read

OpenAI gives U.S. clinicians free ChatGPT and a harder HealthBench

Why it matters: OpenAI is targeting a regulated workflow where accuracy claims carry direct clinical consequences. The linked rollout cites 6,924 physician-reviewed conversations and a 99.6% safe/accurate rating in internal review.

#openai #healthcare #chatgpt

LLM sources.twitter 3d ago 1 min read

Perplexity says Qwen post-training beats GPT on factuality cost

Why it matters: search products need factuality and citations, not just fluent answers. Perplexity said its SFT + RL pipeline lets Qwen models match or beat GPT models on factuality at lower cost.

#perplexity #qwen #retrieval

LLM 4d ago 2 min read

Qwen3.6-Max-Preview pushes coding benchmarks, but stays cloud-only

Alibaba’s April 22 Qwen3.6-Max-Preview post claims top scores across six coding benchmarks and clear gains over Qwen3.6-Plus. The caveat is just as important: this is a hosted proprietary preview, not a new open-weight Qwen release.

#qwen #alibaba #coding-agents

LLM Reddit Apr 19, 2026 2 min read

A 145-result coding eval put Kimi K2.6, Opus 4.7, GLM 5.1 and Minimax under LocalLLaMA review

LocalLLaMA cared about this eval post because it mixed leaderboard data with lived coding-agent pain: Opus 4.7 scored well, but the author says it felt worse in real use.

#coding-agents #benchmarks #kimi

LLM Apr 19, 2026 1 min read

LLM judges miss unsafe answers 30% more when stakes are named

A new arXiv preprint reports that LLM judges became meaningfully more lenient when prompts framed evaluation consequences, exposing a weak point in automated safety and quality benchmarks.

#llm-evals #ai-safety #benchmarks

LLM Reddit Apr 19, 2026 1 min read

A Qwen3.6 tuning post made --n-cpu-moe the LocalLLaMA knob of the day

r/LocalLLaMA cared because the numbers were concrete: 79 t/s on an RTX 5070 Ti with 128K context, tied to one llama.cpp flag choice.

#qwen #llama-cpp #local-llm

LLM Apr 18, 2026 2 min read

MM-WebAgent makes webpage agents coordinate images, code and layout

MM-WebAgent tackles a real flaw in AI-made webpages: models can generate pieces, but the page often loses visual coherence. The paper adds hierarchical planning, self-reflection, a benchmark, and released code/data so builders can test multimodal webpage agents beyond code-only output.

#web-agents #multimodal #aigc

LLM Reddit Apr 18, 2026 2 min read

Opus 4.7’s Reddit benchmark fight was really about refusals versus regression

The r/singularity thread did not just react to Opus 4.7 scoring 41.0% where Opus 4.6 scored 94.7%. The interesting part was the community trying to separate real capability loss from refusal behavior, routing, and benchmark interpretation.

#claude #benchmarks #opus