We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data. It would be a great way to ensure an initial thumbs up from a prominent reviewer. It's a good benchmark but it seems like it would be a good idea to include an additional random or unannounced similar test to catch any benchmaxxing.

▲

simonw 16 hours ago | parent | next [-]

I wrote about that possibility here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

▲

th0ma5 16 hours ago | parent [-]

[flagged]

▲

vanschelven 15 hours ago | parent | next [-]

Whatever you think Jimmc414's _concerns_ are (they merely state a possibility) Simon enumerates a number of concerns in the linked article, and then addresses those. So I'm not sure why you think this is so.

	▲	vnvnff an hour ago \| parent [-]
		It's a pattern: https://news.ycombinator.com/item?id=44725190

▲

dugidugout 16 hours ago | parent | prev [-]

Condescending and disrespectful to whom? Everybody wholsale? This doesnt seem reasonable? Please elaborate.

▲

bravetraveler 14 hours ago | parent | next [-]

Not sure if I'd use the same descriptions so pointedly, but I can see what they mean.

It's perfectly fine to link for convenience, but it does feel a little disrespectful/SEO-y to not 'continue the conversation'. A summary in the very least, how exactly it pertains. Sell us.

In a sense, link-dropping [alone] is saying: "go read this and establish my rhetorical/social position, I'm done here"

Imagine meeting an author/producer/whatever you liked. You'd want to talk about their work, how they created it, the impact it had, and so on. Now imagine if they did that... or if they waved their hand vaguely at a catalog.

▲

simonw 13 hours ago | parent | next [-]

I've genuinely been answering the question "what if the labs are training on your pelican benchmark" 3-4 times a week for several months at this point. I wrote that piece precisely so I didn't have to copy and paste the same arguments into dozens of different conversations.

▲

bravetraveler 12 hours ago | parent [-]

Oh, no. Does this policing job pay well? /s Seriously: less is more, trust the process, any number of platitudes work here. Who are you defending against? Readers, right? You wrote your thing, defended it with more of the thing. It'll permeate. Or it won't. Does it matter?

You could be done, nothing is making you defend this (sorry) asinine benchmark across the internet. Not trying to (m|y)uck your yum, or whatever.

Remember, I did say linking for convenience is fine. We're belaboring the worst reading in comments. Inconsequential, unnecessary heartburn. Link the blog posts together and call it good enough.

▲

Barbing 10 hours ago | parent | next [-]

Surprised to see snark re: what I thought was a standard practice (linking FAQs, essentially).

I hadn’t seen the post. It was relevant. I just read it. Lucky Ten Thousand can read it next time even though I won’t.

Simon has never seemed annoying so unlike other comments that might worry me (even “Opus made this” even though it’s cool but I’m concerned someone astroturfed), that comment would’ve never raised my eyebrows. He’s also dedicated and I love he devotes his time to a new field like this where it’s great to have attempts at benchmarks, folks cutting through chaff, etc.

	▲	bravetraveler 10 hours ago \| parent [-]
		The specific 'question' is a promise to catch training on more publicly available data, and to expect more blog links copied 'into dozens of different conversations'... Jump for joy. Stop the presses. Oops, snarky again :) Yes, the LLM people will train on this. They will train on absolutely everything [as they have]. The comments/links prioritize engagement over awareness. My point, I suppose, if I had one is that this blogosphere can add to the chaff. I'm glad to see Simon here often/interested. Aside: all this concern about over-fitting just reinforces my belief these things won't take the profession any time soon. Maybe the job.

▲

simonw 12 hours ago | parent | prev [-]

You don't have to convince me the pelican riding a bicycle SVG benchmark is asinine. That's kind of the point!

	▲	bravetraveler 12 hours ago \| parent [-]
		Having read the followup post being linked, I'm even more confused. Commenting or, really, anything seems even less worthwhile. That's my point. You bring the benchmark and anticipated their... cheesing, with a promise to catch them on it. Cool announcement of an announcement. Just do that [or don't]. In a hippy sense, this is no longer yours. It's out there. Like everything else anyone wrote. Let the LLM people train on your test. Catch them as claimed. Publish again. Huzzah, industry without overtime in the comments. It makes sense/cents to position yourself this way :) Obviously they're going to train on anything they can get. They did. Mouse, meet cat. Some of us in the house would love it if y'all would keep it down! This is 90s rap beef all over again

▲

charcircuit 11 hours ago | parent | prev | next [-]

If you want a summary you can have your ai assistant summarize the link.

	▲	bravetraveler 11 hours ago \| parent [-]
		Woooooosh, please see if an LLM can help you. I'm not getting paid for this

▲

tomrod 14 hours ago | parent | prev [-]

Hell, I would consider myself graced that simonw, yes, THAT simonw, the LLM whisperer, took time out of his busy schedule to send me to a discussion I might have expressed interest in.

▲

bravetraveler 13 hours ago | parent [-]

> send me to a discussion I might have expressed interest in

No, no, remember? Points to the blog you were already reading! Working diligently to build a brand: podcast, paid newsletter, the works.

	▲	tomrod 7 hours ago \| parent [-]
		I wasn't speaking to this interaction, and my point is genuine. Simonw has done fantastic work in the LLM space

▲

th0ma5 16 hours ago | parent | prev [-]

No, when did I say that?

	▲	dugidugout 15 hours ago \| parent \| next [-]
		It isn't clear what you said. You asserted a pattern of conduct on the user simonw: > I think constantly replying to everybody with some link which doesn't address their concerns Then claimed that conduct was: > condescending and disrespectful. I am asking you to elaborate to whom simonw is condescending and disrespecting. I don't see how it follows.
	▲	15 hours ago \| parent \| prev [-]
		[deleted]

▲

Workaccount2 11 hours ago | parent | prev | next [-]

It would be easy to out models that train on the bike pelican, because they would probably suck at the kayaking bumblebee.

So far though, the models good at bike pelican are also good at kayak bumblebee, or whatever other strange combo you can come up with.

So if they are trying to benchmaxx by making SVG generation stronger, that's not really a miss, is it?

	▲	majormajor 10 hours ago \| parent [-]
		That depends on if "SVG generation" is a particularly useful LLM/coding model skill outside of benchmarking. I.e., if they make that stronger with some params that otherwise may have been used for "rust type system awareness" or somesuch, it might be a net loss outside of the benchmarks.

▲

0cf8612b2e1e 11 hours ago | parent | prev | next [-]

I assume all of the models also have variations on, “how many ‘r’s in strawberry”.

▲

thatwasunusual 11 hours ago | parent | prev | next [-]

> We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data.

I may be stupid, but _why_ is this prompt used as a benchmark? I mean, pelicans _can't_ ride a bicycle, so why is it important for "AI" to show that they can (at least visually)?

The "wine glass problem"[0] - and probably others - seems to me to be a lot more relevant...?

[0] https://medium.com/@joe.richardson.iii/the-curious-case-of-t...

▲

simonw 11 hours ago | parent | next [-]

The fact that pelicans can't ride bicycles is pretty much the point of the benchmark! Asking an LLM to draw something that's physically impossible means it can't just "get it right" - seeing how different models (especially at different sizes) handle the problem is surprisingly interesting.

Honestly though, the benchmark was originally meant to be a stupid joke.

I only started taking it slightly more seriously about six months ago, when I noticed that the quality of the pelican drawings really did correspond quite closely to how generally good the underlying models were.

If a model draws a really good picture of a pelican riding a bicycle there's a solid chance it will be great at all sorts of other things. I wish I could explain why that was!

If you start here and scroll through and look at the progression of pelican on bicycle images it's honestly spooky how well they match the vibes of the models they represent: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...

So ever since then I've continue to get models to draw pelicans. I certainly wouldn't suggest anyone take serious decisions on model usage based on my stupid benchmark, but it's a fun first-day initial impression thing and it appears to be a useful signal for which models are worth diving into in more detail.

▲

thatwasunusual 9 hours ago | parent [-]

> If a model draws a really good picture of a pelican riding a bicycle there's a solid chance it will be great at all sorts of other things.

Why?

If I hired a worker that was really good at drawing pelicans riding a bike, it wouldn't tell me anything about his/her other qualities?!

	▲	suspended_state 4 hours ago \| parent \| next [-]
		Your comment is funny, but please note: it's not drawing a pelican riding a bike, it's describing in SVG a pelican riding a bike. Your candidate would at least displays some knowledge of the SVG specs.
	▲	falcor84 an hour ago \| parent \| prev \| next [-]
		For better or worse, a lot of job interviews actually do use contrived questions like this, such as the infamous "how many golf balls can you fit in a 747?"
	▲	vikramkr 6 hours ago \| parent \| prev \| next [-]
		The difference is that the worker you hire would be a human being and not a large matrix multiplication that had parameters optimized by a a gradient descent process and embeds concepts in a higher dimensional vector space that results in all sorts of weird things like subliminal learning (https://alignment.anthropic.com/2025/subliminal-learning/). It's not a human intelligence - it's a totally different thing, so why would the same test that you use to evaluate human abilities apply here? Also more directly the "all sorts of other things" we want llms to be good at often involve writing code/spatial reasoning/world understanding which creating an svg of a pelican riding a bicycle very very directly evaluates so it's not even that surprising?
	▲	simonw 8 hours ago \| parent \| prev \| next [-]
		I wish I knew why. I didn't think it would be a useful indicator of model skills at all when I started doing it, but over time the pattern has held that performance on pelican riding a bicycle is a good indicator of performance on other tasks.
	▲	jtbaker 8 hours ago \| parent \| prev [-]
		a posteriori knowledge. the pelican isn't the point, it's just amusing. the point is that Simon has seen a correlation between this skill and and the model's general capabilities.

▲

wisty 11 hours ago | parent | prev [-]

It's not nessessarily the best benchmark, it's a popular one, probably because it's funny.

Yes it's like the wine glass thing.

Also it's kind of got depth. Does it draw the pelican and the bicycle? Can the penguin reach the peddles? How?

I can imagine a really good AI finding a funny or creative or realistic way for the penguin to reach the peddles.

An slightly worse AI will do an OK job, maybe just making the bike small or the legs too long.

An OK AI will draw a penguin on top of a bicycle and just call it a day.

It's not as binary as the wine glass example.

▲

thatwasunusual 9 hours ago | parent [-]

> It's not nessessarily the best benchmark, it's a popular one, probably because it's funny.

> Yes it's like the wine glass thing.

No, it's not!

That's part of my point; the wine glass scenario is a _realistic_ scenario. The pelican riding a bike is not. It's a _huge_ difference. Why should we measure intelligence (...) in regards to something that is realistic and something that is unrealistic?

I just don't get it.

	▲	Fnoord 5 hours ago \| parent \| next [-]
		> the wine glass scenario is a _realistic_ scenario It is unrealistic because if you go to a restaurant, you don't get served a glass like that. It is frowned upon (alcohol is a drug, after all) and impractical (wine stains are annoying) to fill a glass of wine as such. A pelican riding a bike, on the other hand, is realistic in a scenario because of TV for children. Example from 1950's animation/comic involving a pelican [1]. [1] https://en.wikipedia.org/wiki/The_Adventures_of_Paddy_the_Pe...
	▲	vikramkr 6 hours ago \| parent \| prev [-]
		If the thing we're measuring is a the ability to write code, visually reason, and handle extrapolating to out of sample prompts, then why shouldn't we evaluate it by asking it to write code to generate a strange image that it wouldn't have seen in its training data?

▲

th0ma5 16 hours ago | parent | prev | next [-]

If this had any substance then it could be criticized, which is what they're trying to avoid.

	▲	Etheryte 13 hours ago \| parent [-]
		How? There's no way for you to verify if they put synthetic data for that into the dataset or not.

▲

16 hours ago | parent | prev [-]

[deleted]