Qwen3.6 pelican test turned HN into a benchmark argument

A joke benchmark with real heat

Hacker News gave Simon Willison's Qwen3.6 pelican post 399 points and 83 comments because it hit a familiar nerve in LLM evaluation. The setup was deliberately silly: ask models to draw an SVG of a pelican riding a bicycle. This time, a 20.9GB quantized Qwen3.6-35B-A3B model running locally on a MacBook Pro M5 produced a more satisfying image than Claude Opus 4.7. Willison stressed that the task is not a robust benchmark, but that did not stop the thread from turning into a sharper argument about what demos do and do not show.

The interesting part is not that Qwen won a bird-on-bike contest. It is that the result arrived on the same day as major model releases, with local-model users looking for concrete signs that open and quantized systems are closing gaps. A single SVG prompt is easy to share, easy to inspect, and easy to argue over. That makes it powerful community fuel even when everyone knows it is not a full evaluation suite.

HN pushed back quickly

Community discussion noted that the backup flamingo test was more ambiguous than the headline made it sound. Some commenters thought the Opus image followed physical structure better, while others focused on Qwen's style and charm. Another thread of pushback was more technical: coding-oriented comparisons still put Opus far ahead on harder programming task sets, and the pelican result should not be read as evidence that a 35B local model is generally stronger.

That split is the point. The post showed how visually persuasive outputs can pull attention away from task fit. Image-like SVG generation, instruction following, spatial reasoning, coding, and multi-turn repair are different skills. A model can produce a delightful first shot and still struggle when the user asks for a precise edit. HN users kept circling that gap between toy and tool.

Why it matters

The pelican test works as a community thermometer. It measures excitement about local inference, skepticism toward polished model cards, and frustration with formal benchmarks that do not always match user experience. Qwen3.6-35B-A3B getting this kind of attention also shows how quickly a quantized model can become part of the practical conversation when it runs on enthusiast hardware.

The sober read is simple: Qwen scored a memorable demo win, not a general victory over Opus 4.7. But the reaction matters because developers increasingly judge models through small, repeated, personal tests. Those tests are messy, biased, and sometimes funny. They are also where a lot of trust gets formed.

Original post · Hacker News discussion

Qwen3.6 pelican test turned HN into a benchmark argument

A joke benchmark with real heat

HN pushed back quickly

Why it matters

Related Articles

Benchmark audit finds 25.7% flawed tasks and shifts agent rankings

Claude Opus 4.8 puts the spotlight on steadier agent work, not a headline leap

DeepSWE’s 113 tasks put GPT-5.5 at 70% and Claude Opus 4.7 at 54%

Comments (0)

Leave a Comment