new | show | ask | jobs Github

simonw 5 hours ago

I generated pelicans riding bicycles on both thinking level low and thinking level high:

https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304...

The high one is notably better - the bicycle frame is the correct shape, unlike thinking level low.

For comparison, here's Opus 4.7: https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c...

▲

GistNoesis 4 hours ago | parent | next [-]

> the bicycle frame is the correct shape

No, the handlebar is wrong. The handle bar is rotating the frame instead of rotating the front wheel. The handle bar should be mounted on the same line as the front wheel is.

Hopefully 4.9 will read my comments :)

	▲	loeg 4 hours ago \| parent [-]
		Could be an extremely high angle stem that just happens to match the downtube angle.

▲

eminence32 2 hours ago | parent | prev | next [-]

I bet someone shares this link every time you post about bicycles, but since I didn't see anyone share it yet in this thread, I'll take the opportunity to do so:

https://www.gianlucagimini.it/portfolio-item/velocipedia/

Turns out even humans can be pretty bad at drawing bicycles :)

	▲	walthamstow an hour ago \| parent \| next [-]
		On a new model release, you can guarantee two things are in the replies to Simon. One is your link, the other is "surely the models are being trained on this now"
	▲	skydhash an hour ago \| parent \| prev [-]
		But if you need to draw a bicycle, you wouldn’t pick a random person in the street. You would hire an artist and you’d be guaranteed to have at least a believable one if not a perfect rendering. No guarantees is why LLM is akin to gambling. Every new context is essentially picking someone out of the crowd.

▲

simonw 3 hours ago | parent | prev | next [-]

Here's pelicans in all of the thinking levels - low, medium, high, xhigh, max

https://tools.simonwillison.net/markdown-svg-renderer#url=ht...

▲

ionwake an hour ago | parent | next [-]

I like the way the max pelican has a stern look on his face

▲

stratos123 3 hours ago | parent | prev [-]

Is the output on the max level meant to be missing?

	▲	simonw 2 hours ago \| parent [-]
		I just fixed that (force refresh). It hit my default 8,000 output token limit, it worked when I bumped that up. For max I used 25 input, 17,167 output which cost me 43 cents! https://www.llm-prices.com/#it=25&ot=17167&ic=5&oc=25&sel=cl...

▲

jonas21 5 hours ago | parent | prev | next [-]

Glad to see that the "high thinking" level adds a helmet. Always a smart choice.

▲

spmartin823 5 hours ago | parent | prev | next [-]

You've peed in the pool Simon, this has to be a part of the internal evals by now! You got to try something new - maybe a panda in a canoe?

	▲	phainopepla2 4 hours ago \| parent \| next [-]
		If these were in the internal evals then the output would be much better. The 4.8 pelicans are pretty meh
	▲	HDThoreaun 4 hours ago \| parent \| prev [-]
		Click the link

▲

impalallama an hour ago | parent | prev | next [-]

I actually like the 4.7 the most, interestingly enough. Not like you can "objectively" weight artistic output like this.

▲

ceroxylon 5 hours ago | parent | prev | next [-]

I really like that thinking level high gave the pelican a helmet.

▲

Xunjin 5 hours ago | parent | prev | next [-]

Hey simonw I love your test, do you think using thinking level "max" makes sense for this test? I would love to see the results about it.

	▲	simonw 3 hours ago \| parent [-]
		I don't think the API supports "max" as an option, that might just be a Claude Code harness thing. UPDATE: My mistake, the API does support max. I added a max one at the bottom of this page (cost 43 cents): https://tools.simonwillison.net/markdown-svg-renderer#url=ht...

▲

toastmaster11 4 hours ago | parent | prev | next [-]

I find the most miraculous thing about 4.7 to be that the pelican is facing left, wonder why the right facing everything is so ubiquitous in these images.

▲

i000 4 hours ago | parent | next [-]

This happened to me in elementary school. We were doing fingerpaintings using plasticine. After all the bikes were hung on the wall, mine was racing the other way... Somehow it really stuck with me.

	▲	sunnybeetroot an hour ago \| parent [-]
		What do you think it means?

▲

gboss 4 hours ago | parent | prev | next [-]

It's facing left but looking right...

	▲	toastmaster11 3 hours ago \| parent [-]
		Profound political commentary?

▲

tancop 2 hours ago | parent | prev [-]

[dead]

▲

yanis_t 5 hours ago | parent | prev | next [-]

Simon, is your pelican test really captures differences among models or should you at least try like 10 times or something to average the random effects

▲

simonw 5 hours ago | parent [-]

I've been meaning to do a "run 3 times and pick the best" version for quite a while, I should really pull the trigger on that one. Currently it's one-shot only.

▲

xiphias2 5 hours ago | parent [-]

Best-of-3 would be cheating, ruin the test, middle of 3 makes more sense

▲

nik736 4 hours ago | parent [-]

Why would you need the 3rd run if you pick the "one in the middle"?

	▲	jmaw 3 hours ago \| parent [-]
		Middle as in not the best, and not the worst. As opposed to the second generated in sequence. But not the best/not the worst is somewhat subjective.. so not sure how well that would work.

▲

silisili 4 hours ago | parent | prev | next [-]

The vast majority (if not all) of these make it impossible to turn, among other fun things. Only out of curiosity, have you tried prompting further with how a bike must operate to see if it does the right thing?

▲

timsuchanek 4 hours ago | parent | prev | next [-]

thanks for always providing this very much on time. I'm wondering what the next, harder challenge could be? Maybe some animated svg?

▲

fragmede 2 hours ago | parent | prev | next [-]

For comparison, what's GPT-5.5 producing today?

	▲	simonw 35 minutes ago \| parent [-]
		The reasoning xhigh one is pretty solid: https://simonwillison.net/2026/Apr/23/gpt-5-5/#and-some-peli...

▲

1attice 5 hours ago | parent | prev | next [-]

That little red hat on hard mode is sending me. 4.8 has whimsy

▲

nickvec 5 hours ago | parent | prev | next [-]

Is the "opossum riding an e-scooter" benchmark in the works for Opus 4.8? ;)

	▲	simonw 5 hours ago \| parent \| next [-]
		Good call, it's cute: https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304... - but nothing like GLM-5.1: shttps://static.simonwillison.net/static/2026/glm-possum-esco...
	▲	3738384848 5 hours ago \| parent \| prev [-]
		[flagged]

▲

whalesalad 4 hours ago | parent | prev | next [-]

Eventually the frontier model folks are going to pick up on your pelican on a bike test and bake-in flawless results for that particular request.

▲

highwaylights 4 hours ago | parent | prev | next [-]

Am I allowed to say that pelican's little helmet is adorable? I can't provide a strong computational proof, or even a shred of anecdata...

...but that pelican's little helmet is adorable.

▲

onlyrealcuzzo 5 hours ago | parent | prev [-]

4.7 reigns supreme IMO.