Qwen3.6 pelican test turned HN into a benchmark argument
Original: Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7 View original →
A joke benchmark with real heat
Hacker News gave Simon Willison's Qwen3.6 pelican post 399 points and 83 comments because it hit a familiar nerve in LLM evaluation. The setup was deliberately silly: ask models to draw an SVG of a pelican riding a bicycle. This time, a 20.9GB quantized Qwen3.6-35B-A3B model running locally on a MacBook Pro M5 produced a more satisfying image than Claude Opus 4.7. Willison stressed that the task is not a robust benchmark, but that did not stop the thread from turning into a sharper argument about what demos do and do not show.
The interesting part is not that Qwen won a bird-on-bike contest. It is that the result arrived on the same day as major model releases, with local-model users looking for concrete signs that open and quantized systems are closing gaps. A single SVG prompt is easy to share, easy to inspect, and easy to argue over. That makes it powerful community fuel even when everyone knows it is not a full evaluation suite.
HN pushed back quickly
Community discussion noted that the backup flamingo test was more ambiguous than the headline made it sound. Some commenters thought the Opus image followed physical structure better, while others focused on Qwen's style and charm. Another thread of pushback was more technical: coding-oriented comparisons still put Opus far ahead on harder programming task sets, and the pelican result should not be read as evidence that a 35B local model is generally stronger.
That split is the point. The post showed how visually persuasive outputs can pull attention away from task fit. Image-like SVG generation, instruction following, spatial reasoning, coding, and multi-turn repair are different skills. A model can produce a delightful first shot and still struggle when the user asks for a precise edit. HN users kept circling that gap between toy and tool.
Why it matters
The pelican test works as a community thermometer. It measures excitement about local inference, skepticism toward polished model cards, and frustration with formal benchmarks that do not always match user experience. Qwen3.6-35B-A3B getting this kind of attention also shows how quickly a quantized model can become part of the practical conversation when it runs on enthusiast hardware.
The sober read is simple: Qwen scored a memorable demo win, not a general victory over Opus 4.7. But the reaction matters because developers increasingly judge models through small, repeated, personal tests. Those tests are messy, biased, and sometimes funny. They are also where a lot of trust gets formed.
Related Articles
LocalLLaMA treated Claude identity verification as more than account policy; it became another argument for local models, privacy control, and fewer gates between users and tools.
A Reddit post in r/LocalLLaMA introduces a GGUF release of Qwen3.5-122B-A10B Uncensored (Aggressive) alongside new K_P quants. The author claims 0/465 refusals and zero capability loss, but those results are presented as the author’s own tests rather than independent verification.
Claude said on April 10, 2026 that Claude for Word is now in beta for Team and Enterprise plans. The add-in drafts, edits, and revises Word files from a sidebar while preserving formatting and returning reviewable tracked changes.
Comments (0)
No comments yet. Be the first to comment!