A new paper discussed in r/MachineLearning argues that unofficial model-access providers can quietly substitute models and distort both research and production results.
A post in r/MachineLearning argues that duplicating a specific seven-layer block inside Qwen2-72B improved benchmark performance without changing any weights.
Anthropic said Claude Opus 4.6 found 22 Firefox vulnerabilities during a two-week collaboration with Mozilla. Mozilla classified 14 as high severity and shipped fixes in Firefox 148.0.
OpenAI released proof attempts for all 10 First Proof problems and said expert feedback suggests at least five may be correct. The company positioned the result as a test of long-horizon reasoning beyond standard benchmarks.
Google opened the AI Center Berlin on March 5 and said the site will connect teams from Google DeepMind, Google Research and Google Cloud with researchers, businesses and policymakers. At the launch, Google also announced long-term research partnerships with TUM and Helmholtz Munich.
Google DeepMind said on February 11, 2026 that Gemini Deep Think is now helping tackle professional problems in mathematics, physics, and computer science under expert supervision. The company tied the claim to two fresh papers, a research agent called Aletheia, and examples ranging from autonomous math results to work on algorithms, optimization, economics, and cosmic-string physics.
Google DeepMind's Aletheia AI research agent solved 6 out of 10 open research-level math problems in the FirstProof Challenge as judged by expert mathematicians. The system also generated a fully autonomous research paper and solved 4 open conjectures from Bloom's Erdős database.
Chinese AI lab DeepSeek plans to release its flagship V4 model this week—a 1-trillion-parameter native multimodal model built around Huawei Ascend chips that deliberately bypasses Nvidia and AMD.
A counterintuitive study found that programming AI agents with more assertive, 'rude' conversational behaviors — including interrupting and strategic silence — significantly improved their performance on complex reasoning tasks.
Researchers have demonstrated that transformer models with fewer than 100 parameters can add two 10-digit numbers with 100% accuracy using digit tokenization, challenging assumptions about the minimum complexity needed for arithmetic reasoning.
A King's College London study tested ChatGPT, Claude, and Gemini in Cold War-style nuclear crisis simulations. AI models chose nuclear escalation in 95% of scenarios and left all eight de-escalation options entirely unused across 21 games.
Inception Labs has released Mercury 2, the first production-ready diffusion language model for reasoning. Running at over 1,000 tokens per second on Blackwell GPUs, it is dramatically faster and cheaper than leading autoregressive competitors.