#benchmarking

AI Reddit Mar 29, 2026 2 min read

r/artificial spotlights BullshitBench v2 as Claude leads the nonsense-detection board

An r/artificial link post resurfaced BullshitBench v2, a community benchmark built around 100 nonsense prompts and a 3-judge panel. The current public leaderboard places Claude Sonnet 4.6 with high reasoning at a 91% green rate and 3% red rate, but the results still need to be read as a community signal rather than a neutral standard.

#ai-evals #benchmarking #claude

Sciences Hacker News Mar 24, 2026 1 min read

Hacker News debates Epoch’s FrontierMath solve confirmation for GPT-5.4 Pro

A heavily discussed HN post focused on Epoch AI’s confirmation that GPT-5.4 Pro helped solve one FrontierMath Open Problems combinatorics challenge, shifting attention from benchmark scores toward expert-verified research workflows.

#frontiermath #gpt-5.4 #mathematics

LLM Reddit Mar 22, 2026 2 min read

r/LocalLLaMA Benchmarks Nemotron Cascade as a Small Open Model With Outsized Coding Scores

A new r/LocalLLaMA thread argues that NVIDIA's Nemotron-Cascade-2-30B-A3B deserves more attention after quick local coding evals came in stronger than expected. The post is interesting because it lines up community measurements with NVIDIA's own push for a reasoning-oriented open MoE model that keeps activated parameters low.

#nvidia #nemotron #local-llm

AI sources.twitter Mar 20, 2026 2 min read

NVIDIA launches SOL-ExecBench to measure GPU kernel optimization against hardware limits

NVIDIA announced SOL-ExecBench on March 20, 2026, a benchmark for real-world GPU kernels that scores optimized CUDA and PyTorch code against Speed-of-Light hardware bounds on NVIDIA B200 systems. The release packages 235 kernel optimization problems drawn from 124 AI models across BF16, FP8, and NVFP4 workloads.

#nvidia #cuda #benchmarking

LLM Reddit Feb 28, 2026 2 min read

r/LocalLLaMA Follow-Up Benchmarks Favor Q4_K_M + fit-nobatch on RTX 5080 16GB

A high-engagement LocalLLaMA follow-up benchmark reports that Qwen3.5-35B-A3B runs best on the tested RTX 5080 setup with Q4_K_M quantization, KV q8_0, and --fit without explicit batch flags.

#qwen #llama-cpp #quantization

AI Reddit Feb 20, 2026 1 min read

r/MachineLearning Post Maps 350+ Competition Trends from 2025

A high-engagement Reddit post summarized 2025 ML competition patterns across major platforms. The author reports tracking roughly 400 contests and first-place solution details for 73, highlighting shifts in tooling, model choices, and compute budgets.

#machine-learning #competitions #kaggle

LLM Feb 15, 2026 2 min read

NIST Opens Public Comment on Draft AI 800-2 Benchmarking Practices

NIST’s CAISI released draft guidance NIST AI 800-2 for automated language-model benchmark evaluations and opened comments through March 31, 2026. The draft focuses on objective setting, execution methodology, and analysis/reporting quality.

#nist #caisi #benchmarking