Google DeepMind’s April 2, 2026 X thread introduced Gemma 4 as a new open model family built for reasoning and agentic workflows. Google says the lineup spans E2B, E4B, 26B MoE, and 31B Dense, and adds native function calling, structured JSON output, and longer context windows.
LLM
RSS FeedA LocalLLaMA post drew attention to PokeClaw, an open-source Android prototype that runs Gemma 4 locally through LiteRT-LM and lets the model tap, swipe, type, open apps, send messages, and manage auto-replies without cloud inference.
HN picked up Nanocode, an open JAX project that packages tokenizer training, pretraining, synthetic data generation, agentic SFT, and DPO into an end-to-end recipe for building a coding model on TPU infrastructure.
A Show HN thread highlighted Gemma Gem, a Chrome extension that runs Gemma 4 locally via WebGPU and exposes page-reading, clicking, typing, scrolling, screenshot, and JavaScript tools without API keys or server-side inference.
GitHub’s April 5 X post pointed developers to Squad, an open-source project built on GitHub Copilot that initializes a preconfigured AI team inside a repository. GitHub says the model works by routing work through a thin coordinator, storing shared decisions in versioned repo files, and letting specialist agents operate in parallel with separate context windows.
In an April 4 X post, GitHub put fresh attention on Agentic Workflows, a technical-preview system that lets teams describe repository chores in Markdown and run them in GitHub Actions with coding agents. The underlying documentation says workflows default to read-only access and rely on reviewable safe outputs for write actions such as opening pull requests or posting issue comments.
A LocalLLaMA demo pointed to Parlor, which runs speech and vision understanding with Gemma 4 E2B and uses Kokoro for text-to-speech, all on-device. The README reports roughly 2.5-3.0 seconds end-to-end latency and about 83 tokens/sec decode speed on an Apple M3 Pro.
A LocalLLaMA explainer argues that Gemma 4 E2B/E4B gain their efficiency from Per-Layer Embeddings. The key point is that many of those parameters behave more like large token lookup tables than always-active compute-heavy layers, which changes the inference trade-off.
Bankai, highlighted in LocalLLaMA, proposes post-training adaptation for true 1-bit LLMs by applying sparse XOR patches directly to binary weights. According to the GitHub repo and paper, patches around 1 KB changed Bonsai 8B behavior with zero inference overhead, fixed 4 of 17 held-out failures without breaking 13 already-correct cases, and could be applied or reverted with the same XOR operation in microseconds.
Andrej Karpathy's April 4, 2026 "LLM Wiki" gist proposes replacing one-shot retrieval with an interlinked wiki that an agent continuously maintains. Hacker News focused on the three-layer design of raw sources, wiki, and schema, plus the ingest, query, and lint loop that lets knowledge compound instead of being rediscovered from scratch for every prompt.
Sebastian Raschka's April 4, 2026 article argues that coding-agent quality is shaped as much by the harness as by the base model. He breaks the stack into six components: live repo context, prompt and cache reuse, structured tools, context reduction, session memory, and bounded subagents. Hacker News treated it as a practical framework for understanding why products like Codex and Claude Code feel stronger than plain chat.
Together Research says LLMs can patch faulty database query plans instead of regenerating them from scratch, and claims up to 4.78x speedups on some TPC-H and TPC-DS workloads. The tweet points to DBPlanBench, a DataFusion-based harness that exposes a physical operator graph to an LLM and uses iterative search to refine plan edits.