Articles

All AI LLM Humanoid Robots Sciences Gaming

Source:

From To

LLM sources.twitter 4d ago 1 min read

Google releases Android Bench to measure LLM performance on Android development

Google AI Developers has released Android Bench, an official leaderboard for LLMs on Android development tasks. In the first results, Gemini 3.1 Pro ranks first, and Google is also publishing the benchmark, dataset, and test harness.

#google #android #benchmark

AI Hacker News 4d ago 2 min read

Hacker News spotlights SWE-CI, a benchmark for long-horizon code-maintenance agents

A front-page Hacker News thread drew attention to SWE-CI, an arXiv benchmark that evaluates coding agents on 100 real repository evolution tasks rather than one-shot bug fixes. The paper frames software maintainability as a CI-loop problem and reports that even strong models still struggle to avoid regressions over long development arcs.

#coding-agents #benchmark #software-engineering

LLM 6d ago 1 min read

Microsoft Research Introduces CORPGEN for Multi-Task Enterprise Agents

Microsoft Research introduced CORPGEN on February 26, 2026 to evaluate and improve agent performance in realistic multi-task office scenarios. The framework reports up to 3.5x higher task completion than baseline systems under heavy concurrent load.

#microsoft #agents #corpgen

LLM 6d ago 1 min read

Microsoft Research Introduces CORPGEN for Multi-Task Enterprise Agents

#microsoft #agents #corpgen

LLM Reddit Mar 4, 2026 1 min read

r/LocalLLaMA benchmark compares Qwen3.5-27B Q4 quants using KLD and size tradeoffs

A high-scoring LocalLLaMA post benchmarked Qwen3.5-27B Q4 GGUF variants against BF16, separating “closest-to-baseline” choices from “best efficiency” picks for constrained VRAM setups.

#qwen #quantization #gguf

LLM Reddit Mar 3, 2026 1 min read

Qwen 2.5 → 3 → 3.5: How Alibaba's Smallest Models Have Transformed Across Generations

A widely-shared r/LocalLLaMA comparison of Qwen's smallest models across three generations (score: 681) reveals extraordinary efficiency gains. The Qwen 3.5 9B now outperforms the previous-generation 80B on several benchmarks, while the 2B handles video understanding better than many 7B models.

#qwen #alibaba #open-source

LLM Mar 3, 2026 1 min read

DeepSeek to Release V4 This Week: 1-Trillion-Parameter Multimodal Model Optimized for Huawei Chips

Chinese AI lab DeepSeek plans to release its flagship V4 model this week—a 1-trillion-parameter native multimodal model built around Huawei Ascend chips that deliberately bypasses Nvidia and AMD.

#open-source #research #benchmark

AI Mar 2, 2026 1 min read

Study: AI Chatbots Escalated to Nuclear Action in 95% of War Game Simulations

A King's College London study tested ChatGPT, Claude, and Gemini in Cold War-style nuclear crisis simulations. AI models chose nuclear escalation in 95% of scenarios and left all eight de-escalation options entirely unused across 21 games.

#safety #research #nuclear

LLM Reddit Mar 1, 2026 2 min read

r/LocalLLaMA Benchmarks: <code>Krasis</code> reports 3,324 tok/s prefill for 80B MoE on one RTX 5080

A r/LocalLLaMA post (score 180, 53 comments) shared benchmark data for <code>Krasis</code>, a hybrid CPU/GPU runtime aimed at large MoE models. The key claim is that GPU-heavy prefill plus CPU decode can reduce long-context waiting time even when full models do not fit in consumer VRAM.

#moe #inference-runtime #llm-serving

LLM Reddit Mar 1, 2026 2 min read

Reddit ML Spotlight: AdderBoard Pushes Tiny Transformer Addition Challenge Below 100 Parameters

A r/MachineLearning post surfaced AdderBoard, where community submissions report 100% 10-digit addition with extremely small transformer designs, including hand-coded models under 100 parameters.

#transformers #tiny-models #benchmark

LLM Reddit Mar 1, 2026 2 min read

Reddit ML Spotlight: AdderBoard Pushes Tiny Transformer Addition Challenge Below 100 Parameters

A r/MachineLearning post surfaced AdderBoard, where community submissions report 100% 10-digit addition with extremely small transformer designs, including hand-coded models under 100 parameters.

#transformers #tiny-models #benchmark

LLM Hacker News Feb 28, 2026 2 min read

HN Debates Claude Code Defaults After 2,430-Prompt Tool Selection Study

A Hacker News thread analyzed a benchmark of 2,430 Claude Code runs, focusing on default stack choices, build-vs-buy behavior, and ecosystem lock-in risks.

#claude-code #developer-tools #llm