I ran these in LM Studio and got unrecognizable pelicans out of the 2B and 4B models and an outstanding pelican out of the 26b-a4b model - I think the best I've seen from a model that runs on my laptop.

https://simonwillison.net/2026/Apr/2/gemma-4/

The gemma-4-31b model is completely broken for me - it just spits out "---\n" no matter what prompt I feed it. I got a pelican out of it via the AI Studio API hosted model instead.

▲

entropicdrifter 2 days ago | parent | next [-]

Your posting of the pelican benchmark is honestly the biggest reason I check the HackerNews comments on big new model announcements

	▲	jckahn 2 days ago \| parent \| next [-]
		All hail the pelican king!
	▲	archon810 2 days ago \| parent \| prev [-]
		He is the JerryRigEverything of pelicans.

▲

yags a day ago | parent | prev | next [-]

We (LM Studio) found the bug with the 31B model and a fix will be going out hopefully tonight

▲

c0wb0yc0d3r a day ago | parent [-]

I am not deep in this world. What does it mean when you (LM Studio) fixed a bug in a model Google released?

	▲	airspresso a day ago \| parent \| next [-]
		There is a surprising amount of code needed in each of the inference frameworks (LM Studio, llama.cpp, etc) to support each new model release. For example to format the input in the right way using a chat template, to parse the output properly with the model-specific tokens the model provider decided to standardize on for their model, and more. This particular instance was a fix to the output parsing [1] in LM Studio, described like this: "Adds value type parsers that use <\|\"\|> as string delimiters instead of JSON's double quotes, and disables json-to-schema conversion for these types." [1]: https://github.com/ggml-org/llama.cpp/pull/21326/commits/a50... edit: formatting
	▲	why_only_15 a day ago \| parent \| prev \| next [-]
		I am in this world, but am not familiar with this specifically. My guess is that they found a bug with their implementation of the model using the weights Google released. These bugs are often difficult to track down because the only indication is that the model is worse with your implementation than with someone else's.
	▲	khimaros a day ago \| parent \| prev [-]
		llama.cpp also fixed some chat template issues this afternoon. could be related.

▲

wordpad 2 days ago | parent | prev | next [-]

Do you think it's just part of their training set now?

▲

alexeiz 2 days ago | parent | next [-]

It's time to do "frog on a skateboard" now.

	▲	Wyverald 8 hours ago \| parent [-]
		In case you haven't seen this: https://x.com/JeffDean/status/2024525132266688757

▲

lysace 2 days ago | parent | prev | next [-]

Seems very likely, even if Google has behaved ethically.

Simon and YC/HN has published/boosted these gradual improvements and evaluations for quite some time now.

There is a https://simonwillison.net/robots.txt but it allows pretty much everything, AI-wise.

▲

simonw 2 days ago | parent | prev [-]

If it's part of their training set why do the 2B and 4B models produce such terrible SVGs?

	▲	vessenes 2 days ago \| parent \| next [-]
		We were promised full SVG zoos, Simon. I want to see SVG pangolins please
	▲	wolttam 2 days ago \| parent \| prev \| next [-]
		Because it is in their training set but it's unrealistic to expect a 2B or 4B model to be able to perfectly reproduce everything it's seen before. The training no doubt contributed to their ability to (very) loosely approximate an SVG of pelican on a bicycle, though. Frankly I'm impressed
	▲	nickpsecurity a day ago \| parent \| prev \| next [-]
		Larger models better understand and reproduce what's in their training set. For example, I used to get verbatim quotes and answers from copyrighted works when I used GPT-3.5. That's what clued me in to the copyright problem. Whereas, the smallest models often produced nonsense about the same topics. Because small models often produce nonsense. You might need to do a new test each time to avoid your old ones being scraped into the training sets. Maybe a new one for each model produced after your last one. Totally unrelated to the last one, too.
	▲	retinaros 2 days ago \| parent \| prev [-]
		because generating nice looking svg requires handling code, shapes, long context, reasoning and at 2b you most likely will break the syntax of the file 9 times out of 10 if you train for that. or you will need to go for simpler pelicans. might not be worth to ft on a 2b. but on their top tier open model it is definitly worth it. even not directly but just crawling a github would make it train on your pelicans.

▲

culi 2 days ago | parent | prev | next [-]

Do you have a single gallery page where we can see all the pelicans together. I'm thinking something similar to

https://clocks.brianmoore.com/

but static.

▲

simonw a day ago | parent | next [-]

Closest I have is this page: https://simonwillison.net/tags/pelican-riding-a-bicycle/

▲

Balinares 2 days ago | parent | prev | next [-]

Absolutely hilarious that Qwen 3.5 had a far better clock than Opus 4.6 each time I looked.

▲

lostmsu 2 days ago | parent | prev | next [-]

Not exactly what you asked for but try https://pelicans.borg.games/

▲

retinaros 2 days ago | parent [-]

what the sorcery is that https://static.simonwillison.net/static/2024/recraft-ai-peli...

I tried their model and asking a few different svg of pelicans. it is INSANE.

▲

dom96 a day ago | parent | next [-]

This is also interesting: https://static.simonwillison.net/static/2025/pov-pelican-opu...

▲

lostmsu 2 days ago | parent | prev [-]

AFAIK that model is pretty old, and it was explicitly trained for SVG generation. For other models the capability of generating SVGs of real stuff is accidental. Same as GPT-5.x and Sonnet 4.5+ being able to generate MIDI music.

	▲	retinaros 2 days ago \| parent [-]
		is it a fine tune of some open source model?

▲

baal80spam 2 days ago | parent | prev [-]

Uh, the GPT-5 clock is... interesting, to say the least.

▲

nateb2022 2 days ago | parent | prev | next [-]

I'd recommend using the instruction tuned variants, the pelicans would probably look a lot better.

▲

Havoc a day ago | parent | prev | next [-]

Same experience on the 31B - something’s wrong. The MoE works as expected though.

▲

hypercube33 2 days ago | parent | prev | next [-]

Mind I ask what your laptop is and configuration hardware wise?

	▲	simonw a day ago \| parent [-]
		128GB M5, but the largest of these models still only use about 20GB of RAM so I'd expect them to work OK on 32GB and up.

▲

Forgeties79 2 days ago | parent | prev | next [-]

Love your work, thank you!

▲

HarHarVeryFunny 14 hours ago | parent | prev [-]

It seems unreasonable to expect an LLM to have an accurate "mental model" of a bicycle since most humans don't either, and it's our written descriptions the LLM is learning from. A multi-modal model trained on captioned pictures isn't much better off, since what would induce it to memorize the details that we also abstract away ("a frame connecting it all together") ? Even posessing AGI, most humans still can't reason their way to a functional bicycle.

Comparing bicycles between LLMs doesn't really tell us much, since how do you differentiate an AI with a good model of a bicycle, but that does a poor job of drawing one with SVG, vs one that that has a much worse model but is in fact doing a great job of rendering it?!

I suppose you could say the same for the Pelican, although it does seem more reasonable to guess that most models could accurately describe the body plan of an animal even if they can't do a good job of drawing one with SVG.

	▲	HarHarVeryFunny 13 hours ago \| parent [-]
		For anyone who downvoted this due to thinking that humans, hence LLMs, DO have a good model of a bicycle, I challenge you to draw one. No cheating and looking at pictures. Pen and paper. Do the easy bit first and draw wheels, seat, handlebars, pedals and chain. Add a stick figure riding it if that helps. Now draw the frame. Now google a photo of a bicycle.