A community developer achieved 100+ t/s decode speed and 585 t/s aggregate throughput for 8 simultaneous requests running Qwen3.5 27B on a dual RTX 3090 setup with NVLink, using vLLM with tensor parallelism and MTP optimization.
LLM
RSS Feedllmfit is an open-source CLI tool that automatically detects your system's RAM, CPU, and GPU specs to recommend the optimal LLM model and quantization level, dramatically lowering the barrier to running local AI.
Following President Trump's order barring federal agencies from using Anthropic products, Claude surged to the top of the US App Store's free apps chart, with daily signups hitting all-time records and free users growing over 60% since January.
A remarkable 13-month comparison: running frontier-level DeepSeek R1 at ~5 tokens/second cost $6,000 in early 2025. Today, you can run a significantly stronger model at the same speed on a $600 mini PC — and get 17-20 t/s with even more capable models.
A developer with a Mac Mini M4 used Claude to reverse engineer Apple's private Neural Engine APIs, bypassed CoreML, and successfully trained a 110M parameter Microgpt model entirely on the ANE — opening new possibilities for NPU-based AI training.
Alibaba's Qwen team has released Qwen 3.5 Small, a new small dense model in their flagship open-source series. The announcement topped r/LocalLLaMA with over 1,000 upvotes, reflecting the local AI community's enthusiasm for capable small models.
A deep-dive into why XML tags work better than other delimiters with Claude — rooted in how Anthropic structured Claude's training data and the model's extensive exposure to XML-structured prompts throughout fine-tuning.
growingSWE has created an interactive walkthrough of Andrej Karpathy's 200-line pure Python GPT implementation, letting you tokenize names, watch softmax convert scores to probabilities, step through backpropagation, and explore attention heatmaps.
The r/LocalLLaMA community is buzzing over Qwen 3.5-35B-A3B, which users report outperforms GPT-OSS-120B while being only one-third the size, making it an excellent local daily driver for development tasks.
The Financial Times reports that DeepSeek V4 is set to launch next week, featuring image and video generation capabilities that position it as a direct competitor to multimodal AI models from OpenAI and Google.
Andrej Karpathy highlights the fundamental memory+compute trade-off challenge in LLMs: fast but small on-chip SRAM versus large but slow off-chip DRAM. He calls optimizing this the most intellectually rewarding puzzle in AI infrastructure today, pointing to NVIDIA's $4.6T market cap as proof.
A r/MachineLearning project post (score 71, 12 comments) introduced <code>Micro Diffusion</code>, a minimal implementation inspired by <code>Microgpt</code>. The author released three versions (143-line NumPy, 292-line NumPy, 413-line PyTorch) that share the same diffusion loop while swapping denoisers.