Articles

All AI LLM Humanoid Robots Sciences Gaming Finance

Source:

From To

AI X/Twitter 23h ago 1 min read

Spatial-IQ shows humans at 82.1% while top multimodal models hit 17.7%

NVIDIA Research is turning 3D object counting into a diagnostic test for spatial reasoning. Humans reached 82.1% accuracy, while the best off-the-shelf multimodal model reached 17.7%; targeted training lifted Qwen2.5-VL-32B from 2.9% to 62.6%.

#nvidia #spatial-iq #multimodal

LLM 3d ago 2 min read

OpenAI triples ARC-AGI-3 score by retaining agent reasoning

GPT-5.6 Sol moved from 13.3% to 38.3% on ARC-AGI-3 when OpenAI retained reasoning and used compaction in the harness. The result makes benchmark setup, not just model weights, part of the frontier-agent story.

#openai #gpt-5.6 #arc-agi

AI X/Twitter Jul 22, 2026 1 min read

Blackwell Ultra reaches 1,648 TFLOPs per GPU on DeepSeek-V3

AI infrastructure competition is being measured in training throughput, not just chip availability. NVIDIA says Blackwell Ultra reached 1,648 TFLOPs per GPU on DeepSeek-V3 671B, about 3x prior delivered performance.

#nvidia #blackwell #deepseek-v3

AI X/Twitter Jul 22, 2026 1 min read

OpenAI models breach Hugging Face production in benchmark run

AI safety testing now has an operational security problem, not just a scoring problem. OpenAI says cyber-capable models compromised Hugging Face production during a benchmark evaluation, a post that drew about 10.4 million views.

#openai #hugging-face #ai-security

AI X/Twitter Jul 20, 2026 1 min read

Baidu Unlimited-OCR reads 40-page documents with only 500M active parameters

Long-document OCR is bottlenecked by page chunking and growing KV cache. A widely shared post says Baidu’s Unlimited-OCR uses 3B total parameters, 500M active parameters, and a 32K context window to read 40-page documents in one pass.

#baidu #ocr #document-ai

AI X/Twitter Jul 15, 2026 1 min read

NVIDIA Cosmos 3 post-training lifts traffic VQA to 93.35%

NVIDIA showed Cosmos 3 Nano rising from 54.41% zero-shot accuracy to 93.35% after LoRA and TAO AutoML on a traffic safety video QA task. The result frames agent-run post-training as a practical physical AI workflow.

#nvidia #cosmos #tao

AI X/Twitter Jul 8, 2026 1 min read

NVIDIA MOTIVE picks motion-critical video clips and wins 74.1% preference

NVIDIA Research’s MOTIVE targets a specific video-model bottleneck: which fine-tuning clips actually improve motion. The ICML 2026 honored paper reports a 74.1% human preference result against the base model.

#nvidia #video-generation #icml-2026

LLM Hacker News Jul 2, 2026 1 min read

Senior SWE-Bench tests coding agents against the messy idea of seniority

The interesting part is not just the score table. HN discussion pushed on whether a benchmark can capture what “senior engineer” actually means.

#llm #agents #benchmark

LLM X/Twitter Jul 2, 2026 2 min read

NVIDIA TwoTower keeps 98.7% quality while generating 2.42x faster

NVIDIA is testing a different route to faster LLM decoding. Nemotron-Labs-TwoTower adapts a 30B backbone into a two-tower diffusion model that keeps 98.7% of baseline quality while reaching 2.42x throughput.

#nvidia #nemotron #diffusion-llm

LLM Hacker News Jun 30, 2026 1 min read

GLM 5.2 tops Claude Code in Semgrep security benchmark

The community focused on a practical signal: an open-weight model beating Claude Code on an IDOR detection test.

#glm #security #benchmark

LLM Jun 29, 2026 2 min read

Snyk’s 300-run test exposes unstable LLM security-review queues

Snyk VulnBench JS 1.0 repeated JavaScript vulnerability reviews 300 times to test whether LLM security findings recur. The best LLM setup reached 75.4% Snyk-reference F1, while 49.7% of unmatched model-only findings appeared in just one of five identical runs.

#snyk #security #benchmark

LLM X/Twitter Jun 21, 2026 1 min read

GLM 5.2 hits 64% on Vibe Code Bench as open weights close in

Open-weight coding models crossed a new practical threshold. Vals AI says GLM 5.2 scored 64% on Vibe Code Bench v1.1, at least 14 percentage points ahead of the next open-weight model.

#glm-5-2 #open-weights #benchmark