A high-signal Hacker News thread surfaced Unsloth’s Qwen3.5 guide, which maps model sizes to bf16 LoRA VRAM budgets and clarifies MoE, vision, and export paths for production workflows.
#llm
Alibaba Qwen team released the Qwen 3.5 small model series (0.8B to 9B). Models run in-browser via WebGPU and show dramatic benchmark improvements over previous generations.
Developer Nick Tikhonov shares how he built a voice AI agent achieving ~400ms end-to-end latency with a full STT → LLM → TTS pipeline, including clean barge-ins and no precomputed responses.
A counterintuitive study found that programming AI agents with more assertive, 'rude' conversational behaviors — including interrupting and strategic silence — significantly improved their performance on complex reasoning tasks.
Researchers have demonstrated that transformer models with fewer than 100 parameters can add two 10-digit numbers with 100% accuracy using digit tokenization, challenging assumptions about the minimum complexity needed for arithmetic reasoning.
Inception Labs has released Mercury 2, the first production-ready diffusion language model for reasoning. Running at over 1,000 tokens per second on Blackwell GPUs, it is dramatically faster and cheaper than leading autoregressive competitors.
Alibaba released the Qwen3.5 small model series (0.8B, 4B, 9B). The 9B model achieves performance comparable to GPT-oss 20B–120B, making high-quality local inference accessible to users with modest GPU hardware.
llmfit is an open-source CLI tool that automatically detects your system's RAM, CPU, and GPU specs to recommend the optimal LLM model and quantization level, dramatically lowering the barrier to running local AI.
llmfit is an open-source CLI tool that automatically detects your system's RAM, CPU, and GPU specs to recommend the optimal LLM model and quantization level, dramatically lowering the barrier to running local AI.
A deep-dive into why XML tags work better than other delimiters with Claude — rooted in how Anthropic structured Claude's training data and the model's extensive exposure to XML-structured prompts throughout fine-tuning.
Developer Eric Holmes argues that MCP is already dying, claiming LLMs already excel at using CLI tools without a special protocol. He makes a strong case that CLIs compose better, debug easier, and work with existing auth systems.
A deep-dive into why XML tags work better than other delimiters with Claude — rooted in how Anthropic structured Claude's training data and the model's extensive exposure to XML-structured prompts throughout fine-tuning.