Articles

All AI LLM Humanoid Robots Sciences Gaming Finance

Source:

From To

LLM Hacker News Jul 10, 2026 1 min read

Colibri Runs GLM-5.2 on a Slow PC, and the Real Debate Is Memory Movement

The community interest came from a practical question: can a huge MoE model be useful on ordinary hardware? Colibri uses GLM-5.2’s sparse activation pattern to avoid loading the whole model into RAM or a GPU at once.

#glm-5.2 #local-ai #inference

LLM Reddit Jul 4, 2026 1 min read

GLM5.2 at home turns local LLM enthusiasm into a hardware bill

A LocalLLaMA build with five RTX PRO 6000 cards and a 5090 made the practical cost of serious local inference hard to ignore.

#glm #local-llm #gpu

AI Jul 2, 2026 2 min read

Etched puts working silicon and $1B in contracts behind inference ASICs

Etched came out of stealth with a working chip, $800 million raised and more than $1 billion in signed customer contracts. The bigger signal is that AI inference is becoming a full-stack systems race, not just a hunt for more general-purpose GPUs.

#etched #ai-chips #inference

AI Jul 2, 2026 1 min read

Together AI’s $800M round turns open-model inference into a scale race

Together AI raised $800 million at an $8.3 billion valuation, a large bet that open-model infrastructure can undercut closed-model economics. The company says annual bookings topped $1.15 billion last quarter and plans to expand capacity about 50-fold over five years.

#together-ai #funding #inference

AI X/Twitter Jun 27, 2026 2 min read

NVIDIA Inference Hub gives engineers one API for 100-plus AI models

Enterprise AI bottlenecks are shifting from model access to operational control. NVIDIA says its internal Enterprise Inference Hub serves more than 100 model endpoints and processes trillions of tokens every week.

#nvidia #inference #litellm

LLM Reddit Jun 26, 2026 1 min read

NVIDIA’s Nemotron-TwoTower tests diffusion-style generation for LLMs

LocalLLaMA focused on the practical question: can a diffusion LLM keep quality while making generation meaningfully faster?

#nvidia #nemotron #diffusion

LLM X/Twitter Jun 25, 2026 2 min read

OpenAI’s first Jalapeno chip targets LLM inference after 9-month tape-out

The AI bottleneck is shifting from model release cadence to inference infrastructure. OpenAI says Jalapeno was taped out in nine months and is planned for gigawatt-scale deployment beginning in late 2026.

#openai #broadcom #ai-chip

AI Hacker News Jun 24, 2026 2 min read

AI token pricing has reached the ROI phase

The HN discussion focused less on model quality and more on cost control. As generative AI moves from experimentation into operating budgets, token pricing is becoming a buying constraint.

#ai #pricing #inference

LLM Hacker News Jun 20, 2026 1 min read

Local Qwen is not a worse Opus; it is a different operating model

Alex Ellis’s post resonated because it framed local LLMs through business use, control, cost, and agent reliability instead of a simple benchmark ladder.

#qwen #local-llm #coding-agents

LLM Reddit Jun 14, 2026 1 min read

Xiaomi’s 1T MiMo speed claim puts DFlash and GPU codesign under LocalLLaMA scrutiny

The LocalLLaMA angle is not just the 1000+ tps headline, but whether FP4, DFlash, and commodity GPU kernels can be reproduced outside Xiaomi’s hosted trial.

#xiaomi #mimo #inference

LLM Jun 13, 2026 1 min read

AgentPerf reframes AI infra: GB300 serves 20x more coding agents per MW

NVIDIA says its GB300 NVL72 delivered up to 20x more concurrent agentic coding capacity per megawatt than H200 on Artificial Analysis’ new AA-AgentPerf benchmark. The test measures concurrent AI agents under service-level objectives, not just raw token throughput.

#nvidia #agentperf #benchmark

LLM Jun 12, 2026 2 min read

DiffusionGemma cuts the token bottleneck with a 26B open model

Google DeepMind released DiffusionGemma, a 26B MoE open model that uses text diffusion instead of token-by-token decoding. The pitch is up to 4x faster generation on dedicated GPUs for local, interactive workflows.

#google #deepmind #gemma