An r/artificial link post resurfaced BullshitBench v2, a community benchmark built around 100 nonsense prompts and a 3-judge panel. The current public leaderboard places Claude Sonnet 4.6 with high reasoning at a 91% green rate and 3% red rate, but the results still need to be read as a community signal rather than a neutral standard.
#benchmarking
RSS FeedA heavily discussed HN post focused on Epoch AI’s confirmation that GPT-5.4 Pro helped solve one FrontierMath Open Problems combinatorics challenge, shifting attention from benchmark scores toward expert-verified research workflows.
A new r/LocalLLaMA thread argues that NVIDIA's Nemotron-Cascade-2-30B-A3B deserves more attention after quick local coding evals came in stronger than expected. The post is interesting because it lines up community measurements with NVIDIA's own push for a reasoning-oriented open MoE model that keeps activated parameters low.
NVIDIA announced SOL-ExecBench on March 20, 2026, a benchmark for real-world GPU kernels that scores optimized CUDA and PyTorch code against Speed-of-Light hardware bounds on NVIDIA B200 systems. The release packages 235 kernel optimization problems drawn from 124 AI models across BF16, FP8, and NVFP4 workloads.
A high-engagement LocalLLaMA follow-up benchmark reports that Qwen3.5-35B-A3B runs best on the tested RTX 5080 setup with Q4_K_M quantization, KV q8_0, and --fit without explicit batch flags.
A high-engagement Reddit post summarized 2025 ML competition patterns across major platforms. The author reports tracking roughly 400 contests and first-place solution details for 73, highlighting shifts in tooling, model choices, and compute budgets.
NIST’s CAISI released draft guidance NIST AI 800-2 for automated language-model benchmark evaluations and opened comments through March 31, 2026. The draft focuses on objective setting, execution methodology, and analysis/reporting quality.