LLM

LLM Feb 23, 2026 1 min read

Google Releases Gemini 3.1 Pro with 77.1% on ARC-AGI-2

Google's Gemini 3.1 Pro achieves 77.1% on ARC-AGI-2—more than doubling the previous Gemini 3 Pro's score. The mid-cycle upgrade brings Deep Think-level reasoning capabilities to all users and developers.

#google #gemini #benchmark

LLM Reddit Feb 23, 2026 1 min read

Qwen Team Confirms Serious Data Quality Problems in GPQA and HLE Benchmarks

The Qwen research team has officially confirmed through a published paper that GPQA and HLE (Humanity's Last Exam) benchmark datasets contain serious quality issues — including OCR errors, incorrect gold-standard answers, and unverifiable questions — casting doubt on the reliability of current AI model evaluations.

#qwen #benchmark #gpqa

LLM Reddit Feb 23, 2026 1 min read

Gemini 3.1 Pro Built a Fully Playable Space Game Through Natural Language Alone

A user created a fully playable space exploration game using only natural language instructions to Gemini 3.1 Pro over a few hours. The AI handled performance optimization, soundtrack generation, and UI design entirely from plain language requests, producing around 1,800 lines of HTML code.

#gemini #google #code-generation

LLM X/Twitter Feb 22, 2026 1 min read

Google DeepMind Releases Gemini 3.1 Pro: 2x Reasoning Boost and Record Benchmark Scores

Google DeepMind has released Gemini 3.1 Pro with over 2x reasoning performance versus Gemini 3 Pro. The model scores 77.1% on ARC-AGI-2 (up from 31.1%), 80.6% on SWE-bench Verified, and tops 12 of 18 tracked benchmarks at unchanged $2/$12 per million token pricing.

#gemini #google-deepmind #llm

LLM Hacker News Feb 22, 2026 2 min read

Taalas Prints LLM Weights into Silicon: 17,000 Tokens/sec at 10x Lower Cost

Taalas has released an ASIC chip that physically etches Llama 3.1 8B model weights into silicon, achieving 17,000 tokens per second—10x faster, 10x cheaper, and 10x more power-efficient than GPU-based inference systems.

#taalas #asic #llm

LLM Feb 22, 2026 1 min read

Cohere Launches Tiny Aya: 3.35B Open-Weight Models Supporting 70+ Languages for Offline Use

At the India AI Summit on February 17, Cohere released Tiny Aya, a family of 3.35B open-weight multilingual models supporting 70+ languages that run offline on standard laptops, targeting global language accessibility.

#cohere #open-source #multilingual

LLM Feb 22, 2026 1 min read

ByteDance Launches Doubao 2.0 — Frontier-Level AI at One-Tenth the Cost

ByteDance released Doubao 2.0 ahead of Lunar New Year, claiming GPT-5.2 and Gemini 3 Pro parity with 98.3 on AIME 2025, a 3020 Codeforces rating, and pricing 10x cheaper than Western rivals.

#bytedance #llm #product-launch

LLM Reddit Feb 22, 2026 1 min read

Claude Opus 4.6 Hits 14.5-Hour Mark on METR's Software Task Benchmark

Claude Opus 4.6 achieved a 50%-time-horizon of approximately 14.5 hours on METR's software task benchmark — beating all predictions and suggesting a doubling time of under 3 months for AI task capabilities.

#claude #anthropic #metr

LLM Hacker News Feb 22, 2026 1 min read

Running Llama 3.1 70B on a Single RTX 3090 via NVMe-to-GPU

A new open-source project called ntransformer enables running the 140GB Llama 3.1 70B model on a single consumer RTX 3090 by streaming weights directly from NVMe storage to GPU, completely bypassing CPU RAM.

#llama #gpu #open-source

LLM Hacker News Feb 22, 2026 1 min read

Karpathy: "Claws" Are a New Layer on Top of LLM Agents

Andrej Karpathy coined a new term for OpenClaw-like AI agent systems: "Claws." Just as LLM agents were a new layer on top of LLMs, Claws provide orchestration, scheduling, persistent context, and tool calls on top of LLM agents.

#llm-agents #karpathy #openclaw

LLM Feb 22, 2026 1 min read

Anthropic's Claude Code Surpasses $2.5B ARR, Now Half of All Enterprise Spending

Claude Code has grown to over $2.5 billion in annualized run-rate revenue as of February 2026, more than doubling since its first six months. The AI coding agent now accounts for over half of all enterprise spending on Anthropic and users average 20 hours per week with the product.

#anthropic #claude-code #funding

LLM Feb 22, 2026 1 min read

xAI Launches Grok 4.20 Beta: The First Grok That Learns After Deployment

xAI released Grok 4.20 as a public beta on February 17, introducing a continuous post-deployment learning architecture that updates the model weekly from user feedback. The release also adds a four-agent collaboration system and medical document analysis via photo upload.

#xai #grok #product-launch