LLM

LLM X/Twitter Mar 28, 2026 2 min read

Cursor brings self-hosted cloud agents inside enterprise networks

Cursor said on March 25, 2026 that cloud agents can now run on customer infrastructure while preserving the same agent harness and workflow experience. Cursor's product post says the generally available setup keeps code, tool execution, and build artifacts inside the customer's network while still giving agents isolated remote environments, multi-model support, and plugin/MCP extensibility.

#cursor #cloud-agents #self-hosting

LLM X/Twitter Mar 28, 2026 2 min read

Anthropic revisits a multi-agent Claude harness for long-running software engineering

AnthropicAI highlighted an Engineering Blog post on March 24, 2026 about using a multi-agent harness to keep Claude productive across frontend and long-running software engineering tasks. The underlying Anthropic post explains how initializer agents, incremental coding sessions, progress logs, structured feature lists, and browser-based testing can reduce context-window drift and premature task completion.

#anthropic #claude #agents

LLM Reddit Mar 28, 2026 2 min read

r/LocalLLaMA focuses on TurboQuant’s attempt to shrink KV cache bottlenecks

A popular r/LocalLLaMA post revived attention around Google Research’s TurboQuant by tying it directly to local inference constraints. The method’s reported 3-bit KV cache compression and 6x memory reduction make it relevant well beyond research headlines, but its practical value will depend on whether it reaches real deployment stacks.

#compression #kv-cache #quantization

LLM Reddit Mar 28, 2026 2 min read

r/MachineLearning challenges LoCoMo’s reliability with a detailed audit

A post on r/MachineLearning argues that LoCoMo’s leaderboard is being treated with more confidence than its evaluation setup deserves. The audit claims the benchmark has a 6.4% ground-truth error rate and that its judge accepts intentionally wrong but topically adjacent answers far too often, turning attention from raw scores to benchmark reliability.

#benchmarks #evaluation #long-context

LLM Hacker News Mar 28, 2026 2 min read

Hacker News spotlights ATLAS and the economics of local coding agents

A Hacker News post pushed ATLAS into the spotlight by framing a consumer-GPU coding agent as a serious cost challenger to hosted systems. The headline benchmark is interesting, but the repository itself makes clear that its 74.6% result is not a controlled head-to-head against Claude 4.5 Sonnet because the task counts and evaluation protocols differ.

#coding-agents #benchmarks #local-inference

LLM Hacker News Mar 28, 2026 2 min read

Hacker News surfaces `.claude` as a repo-native operating layer for Claude Code

A Hacker News discussion around the `.claude` folder guide frames Claude Code configuration as versioned project infrastructure rather than repeated prompt setup. The breakdown of `CLAUDE.md`, rules, commands, skills, and agents shows how teams can standardize workflows, but it also creates a new governance layer for instructions.

#claude-code #agents #developer-tools

LLM Reddit Mar 28, 2026 2 min read

LocalLLaMA Follows a 1.1M Tok/s Qwen 3.5 27B Run as vLLM Tuning Becomes the Real Story

A March 26, 2026 r/LocalLLaMA post about serving Qwen 3.5 27B on Google Cloud B200 clusters reached 205 points and 52 comments at crawl time. The linked write-up reports 1,103,941 total tokens per second on 12 nodes after switching from tensor to data parallelism, shrinking context length, enabling FP8 KV cache, and using MTP-1 speculative decoding.

#qwen #vllm #nvidia-b200

LLM Reddit Mar 28, 2026 2 min read

LocalLLaMA Tracks NVIDIA's gpt-oss-puzzle-88B as Puzzle Shrinks gpt-oss-120b for Cheaper Serving

A March 26, 2026 r/LocalLLaMA post linking NVIDIA's `gpt-oss-puzzle-88B` model card reached 284 points and 105 comments at crawl time. NVIDIA says the 88B MoE model uses its Puzzle post-training NAS pipeline to cut parameters and KV-cache costs while keeping reasoning accuracy near or above the parent model.

#nvidia #gpt-oss #open-weights

LLM Hacker News Mar 28, 2026 2 min read

Hacker News Tracks Claude Code's Web Scheduler as Anthropic Pushes Recurring Agent Work Into the Cloud

A March 27, 2026 Hacker News post linking Claude Code's new scheduling docs reached 282 points and 230 comments at crawl time. Anthropic says scheduled tasks run on Anthropic-managed infrastructure, can clone GitHub repos into fresh sessions, and are available to Pro, Max, Team, and Enterprise users.

#claude-code #anthropic #scheduled-tasks

LLM X/Twitter Mar 27, 2026 2 min read

Together Research says divide-and-conquer long-context pipelines can beat GPT-4o single-shot

Together Research said on March 27, 2026 that a smaller model using divide-and-conquer can match or outperform GPT-4o on long-context tasks, with the work accepted at ICLR 2026. Together's blog and the arXiv paper say the method uses a planner-worker-manager pipeline and explains long-context failures in terms of task, model, and aggregator noise.

#together-ai #long-context #multi-agent

LLM X/Twitter Mar 27, 2026 2 min read

OpenAI rolls out Codex plugins to connect Slack, Figma, Notion, Gmail, and more

OpenAI Devs said on March 26, 2026 that plugins are rolling out in Codex, letting the agent work with common tools such as Slack, Figma, Notion, and Gmail. OpenAI's Codex docs describe plugins as reusable bundles that package skills, app integrations, and MCP server settings, turning Codex into a more shareable workflow layer for teams.

#openai #codex #plugins

LLM Reddit Mar 27, 2026 2 min read

LocalLLaMA Highlights a Sparse V Dequant Trick for TurboQuant in llama.cpp

A LocalLLaMA self-post shared an open-source TurboQuant implementation for llama.cpp that skips value dequantization when attention weights are negligible. The author reports a 22.8% decode gain at 32K context on Qwen3.5-35B-A3B over Apple M5 Max, with unchanged perplexity and better needle-in-a-haystack retrieval.

#llm-inference #kv-cache #llama-cpp