A recent r/LocalLLaMA post presents Qwen3.5 27B as an unusually strong local inference sweet spot. The author reports about 19.7 tokens per second on an RTX A6000 48GB with llama.cpp and a 32K context, while the comments turn into a detailed debate about dense-versus-MoE VRAM economics.
LLM
RSS FeedMegaTrain proposes training 100B+ parameter LLMs at full precision on a single GPU by keeping parameters and optimizer states in host memory and streaming layers through the device. The recent Hacker News interest is notable because the paper reframes the problem as one of memory-system design rather than simple GPU count.
On April 7, 2026, OpenAI’s Tibo Sottiaux said Codex reached 3 million weekly users. He added that the jump from 2 million to 3 million took less than a month, and OpenAI will reset usage limits at each additional million users until the product reaches 10 million weekly users.
A popular r/LocalLLaMA self-post lays out a concrete 2x H200 serving stack for GPT-OSS-120B, including routing, monitoring, and queueing tradeoffs. The appeal is not just the headline throughput, but the unusually detailed operational data behind it.
Anthropic's April 7, 2026 security write-up for Claude Mythos Preview argues that frontier LLM gains are now translating into real exploit-development capability. Hacker News is treating the post as a sign that defensive tooling and offensive risk are accelerating together.
A high-signal r/LocalLLaMA thread is circulating practical Gemma 4 fine-tuning guidance from Unsloth. The post claims Gemma-4-E2B and E4B can be adapted locally with 8GB VRAM, about 1.5x faster training, roughly 60% less VRAM than FA2 setups, and several fixes for early Gemma 4 training and inference bugs.
A detailed r/MachineLearning post is drawing interest to Dante-2B, a 2.1B dense Italian/English model trained from scratch on 2×H200 GPUs. The project emphasizes tokenizer efficiency for Italian, a 300B token corpus, and a fully open release of weights, tokenizer, and training pipeline after phase 2.
Hacker News picked up Z.ai's GLM-5.1 as a model aimed less at one-shot wins and more at sustained agentic work. Z.ai reports 58.4 on SWE-Bench Pro, 42.7 on NL2Repo, 66.5 on Terminal Bench 2.0, and long-horizon runs that keep improving through hundreds of iterations and thousands of tool calls.
GitHub Changelog's March 19, 2026 X post announced that GPT-5.3-Codex is the first long-term support model for Copilot Business and Copilot Enterprise. GitHub says the model launched on February 5, 2026, stays available through February 4, 2027, and becomes the new base model by May 17, 2026.
GitHub Changelog said on April 3, 2026 that GPT-5.1 Codex, GPT-5.1-Codex-Max, and GPT-5.1-Codex-Mini were deprecated across all Copilot surfaces as of April 1. GitHub tells organizations to move workflows and model policies to supported models, with GPT-5.3-Codex named as the replacement.
GitHub Changelog's April 7, 2026 X post said Copilot CLI can now connect to Azure OpenAI, Anthropic, and other OpenAI-compatible endpoints, or run fully local models instead of GitHub-hosted routing. GitHub's changelog adds that offline mode disables telemetry, unauthenticated use is possible with provider credentials alone, and built-in sub-agents inherit the chosen provider.
A LocalLLaMA thread pulled attention to DFlash, a block-diffusion draft model for speculative decoding whose paper claims lossless acceleration above 6x and direct support for vLLM, SGLang, and selected Transformers backends.