Remix.run Logo
simonw 4 hours ago

I've been running this on my laptop with the Unsloth 20.9GB GGUF in LM Studio: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/mai...

It drew a better pelican riding a bicycle than Opus 4.7 did! https://simonwillison.net/2026/Apr/16/qwen-beats-opus/

jubilanti 3 hours ago | parent | next [-]

I wonder when pelican riding a bicycle will be useless as an evaluation task. The point was that it was something weird nobody had ever really thought about before, not in the benchmarks or even something a team would run internally. But now I'd bet internally this is one of the new Shirley Cards.

abustamam 3 hours ago | parent | next [-]

Simon has an article on this

https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

amelius 2 hours ago | parent | prev | next [-]

Yeah try it with something else, or e.g. add a tiger to the back seat.

rafaelmn 3 hours ago | parent | prev | next [-]

I mean look at the result where he asked about a unicycle - the model couldn't even keep the spokes inside the wheels - would be rudimentary if it "learned" what it means to draw a bicycle wheel and could transfer that to unicycle.

duzer65657 2 hours ago | parent [-]

it's the frame that's surprisingly - and consistentnly - wrong. You'd think two triangles would be pretty easy to repro; once you get that the rest is easy. It's not like he's asking "draw a pelican on a four-bar linkage suspension mountainbike..."

Reddit_MLP2 2 hours ago | parent [-]

This is older, but even humans don't have a great concept of how a bicycle works... https://twistedsifter.com/2016/04/artist-asks-people-to-draw...

yndoendo an hour ago | parent [-]

Wouldn't this be more about being capable of mentally remembering how a bicycle looks versus how it works?

This reminds me of Pictionary. [0] Some people are good and some are really bad.

I am really bad a remembering how items look in my head and fail at drawing in Pictionary. My drawing skills are tied to being able to copy what I see.

[0] https://en.wikipedia.org/wiki/Pictionary

MagicMoonlight 2 hours ago | parent | prev [-]

They’ll hardcode it in 4.8, just like they do when they need to “fix” other issues

kelnos an hour ago | parent | prev | next [-]

I'm not sure how you can give the flamingo win to Qwen:

* It's sitting on the tire, not the seat.

* Is that weird white and black thing supposed to be a beak? If so, it's sticking out of the side of its face rather than the center.

* The wheel spokes are bizarre.

* One of the flamingo's legs doesn't extend to the pedal.

* If you look closely at the sunglasses, they're semi-transparent, and the flamingo only has one eye! Or the other eye is just on a different part of its face, which means the sunglasses aren't positioned correctly. Or the other eye isn't.

* (subjective) The sunglasses and bowtie are cute, but you didn't ask for them, so I'd actually dock points for that.

* (subjective) I guess flamingos have multiple tail feathers, but it looks kinda odd as drawn.

In contrast, Opus's flamingo isn't as detailed or fancy, but more or less all of it looks correct.

21 minutes ago | parent [-]
[deleted]
culi 2 hours ago | parent | prev | next [-]

the more I look at these images the more convinced I become that world models are the major missing piece and that these really are ultimately just stochastic sentence machines. Maybe Chomsky was right

bertili 4 hours ago | parent | prev | next [-]

It's fascinating that a $999 Mac Mini (M4 32GB) with almost similar wattage as a human brain gets us this far.

bwv848 31 minutes ago | parent | prev | next [-]

I've been trying the Q4_K_M version, and sometimes it gets stuck in a loop. Gemma 4 doesn’t have this issue.

monksy 32 minutes ago | parent | prev | next [-]

Hey I really enjoy your blog. On some things I end up finding a blog post of yours thats a year+ old and at other times, you and I are investigating similar things. I just pulled Qwen3.6 - 35b -A3B (Can't believe thats a A3B coming from 35b).

I'm impressed about the reach of your blog, and I'm hoping to get into blogging similar things. I currently have a lot on my backlog to blog about.

In short, keep up the good work with an interesting blog!

rdslw 2 hours ago | parent | prev | next [-]

interesting, I just tried this very model, unsloth, Q8, so in theory more capable than Simon's Q4, and get those three "pelicans". definitely NOT opus quality. lmstudio, via Simon's llm, but not apple/mlx. Of course the same short prompt.

Simon, any ideas?

https://ibb.co/gFvwzf7M

https://ibb.co/dYHRC3y

https://ibb.co/FLc6kggm (tried here temperature 0.7 instead of pure defaults)

cyclopeanutopia 4 hours ago | parent | prev | next [-]

But that you also gave a win to Qwen on flamingo is pretty outrageous! :)

Tthe right one looks much better, plus adding sunglasses without prompting is not that great. Hopefully it won't add some backdoor to the generated code without asking. ;)

simonw 3 hours ago | parent [-]

I love how the Chinese models often have an unprompted predilection to add flair.

GLM-5.1 added a sparkling earring to a north Virginia opossum the other day and I was delighted: https://simonwillison.net/2026/Apr/7/glm-51/

monksy 34 minutes ago | parent [-]

You're running 5.1 locally or hosted?

prirun 3 hours ago | parent | prev | next [-]

The flamingo on Qwen's unicycle is sitting on the tire, not the seat. That wins because of sunglasses?

akavel an hour ago | parent | next [-]

Well, maybe the flamingo is a really good unicyclist...

https://youtu.be/Rrpgd5oIKwI

evilduck 2 hours ago | parent | prev [-]

Can a benchmark meant as a joke not use a fun interpretation of results? The Qwen result has far better style points. Fun sunglasses, a shadow, a better ground, a better sky, clouds, flowers, etc.

If we want to get nitty gritty about the details of a joke, a flamingo probably couldn't physically sit on a unicycle's seat and also reach the pedals anyways.

MeteorMarc 2 hours ago | parent | prev | next [-]

Interesting, qwen has the pelican driving on the left lane. Coincidence or has it something to do with the workers providing the RL data?

rubiquity 2 hours ago | parent [-]

Could be on a bike path where bikes are on the left and pedestrians to the right.

jamwise 4 hours ago | parent | prev | next [-]

I've had some really gnarly SVGs from Claude. Here's what I got after many iterations trying to draw a hand: https://imgur.com/a/X4Jqius

giantg2 3 hours ago | parent [-]

Probably because all the training material of humans drawing hands are garbage haha.

danielhanchen 4 hours ago | parent | prev | next [-]

Oh that is pretty good! And the SVG one!

3 hours ago | parent | prev | next [-]
[deleted]
slekker 4 hours ago | parent | prev [-]

How does it do with the "car wash" benchmark? :D