Qwen3.6 pelican test turned HN into a benchmark argument
Original: Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7 View original →
A joke benchmark with real heat
Hacker News gave Simon Willison's Qwen3.6 pelican post 399 points and 83 comments because it hit a familiar nerve in LLM evaluation. The setup was deliberately silly: ask models to draw an SVG of a pelican riding a bicycle. This time, a 20.9GB quantized Qwen3.6-35B-A3B model running locally on a MacBook Pro M5 produced a more satisfying image than Claude Opus 4.7. Willison stressed that the task is not a robust benchmark, but that did not stop the thread from turning into a sharper argument about what demos do and do not show.
The interesting part is not that Qwen won a bird-on-bike contest. It is that the result arrived on the same day as major model releases, with local-model users looking for concrete signs that open and quantized systems are closing gaps. A single SVG prompt is easy to share, easy to inspect, and easy to argue over. That makes it powerful community fuel even when everyone knows it is not a full evaluation suite.
HN pushed back quickly
Community discussion noted that the backup flamingo test was more ambiguous than the headline made it sound. Some commenters thought the Opus image followed physical structure better, while others focused on Qwen's style and charm. Another thread of pushback was more technical: coding-oriented comparisons still put Opus far ahead on harder programming task sets, and the pelican result should not be read as evidence that a 35B local model is generally stronger.
That split is the point. The post showed how visually persuasive outputs can pull attention away from task fit. Image-like SVG generation, instruction following, spatial reasoning, coding, and multi-turn repair are different skills. A model can produce a delightful first shot and still struggle when the user asks for a precise edit. HN users kept circling that gap between toy and tool.
Why it matters
The pelican test works as a community thermometer. It measures excitement about local inference, skepticism toward polished model cards, and frustration with formal benchmarks that do not always match user experience. Qwen3.6-35B-A3B getting this kind of attention also shows how quickly a quantized model can become part of the practical conversation when it runs on enthusiast hardware.
The sober read is simple: Qwen scored a memorable demo win, not a general victory over Opus 4.7. But the reaction matters because developers increasingly judge models through small, repeated, personal tests. Those tests are messy, biased, and sometimes funny. They are also where a lot of trust gets formed.
Related Articles
The weak point in model leaderboards may be the tasks, not only the models. A new arXiv paper reports critical issues in more than 25.7% of evaluated benchmark tasks and shows ranking shifts after filtering flawed items.
HN readers focused less on the version number and more on whether same-price upgrades, cheaper fast mode, and Claude Code dynamic workflows will show up in real agent sessions.
DeepSWE reframes coding-agent evaluation with 113 original tasks across 91 repositories. Its first board gives GPT-5.5 a 70.0% pass@1 score, versus 54.2% for Claude Opus 4.7.
Comments (0)
No comments yet. Be the first to comment!