Remix.run Logo
simonw 18 hours ago

  llm install llm-mistral
  llm mistral refresh
  llm -m mistral/devstral-2512 "Generate an SVG of a pelican riding a bicycle"
https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...

Pretty good for a 123B model!

(That said I'm not 100% certain I guessed the correct model ID, I asked Mistral here: https://x.com/simonw/status/1998435424847675429)

Jimmc414 16 hours ago | parent | next [-]

We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data. It would be a great way to ensure an initial thumbs up from a prominent reviewer. It's a good benchmark but it seems like it would be a good idea to include an additional random or unannounced similar test to catch any benchmaxxing.

simonw 16 hours ago | parent | next [-]

I wrote about that possibility here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

th0ma5 16 hours ago | parent [-]

[flagged]

vanschelven 15 hours ago | parent | next [-]

Whatever you think Jimmc414's _concerns_ are (they merely state a possibility) Simon enumerates a number of concerns in the linked article, and then addresses those. So I'm not sure why you think this is so.

vnvnff an hour ago | parent [-]

It's a pattern: https://news.ycombinator.com/item?id=44725190

dugidugout 16 hours ago | parent | prev [-]

Condescending and disrespectful to whom? Everybody wholsale? This doesnt seem reasonable? Please elaborate.

bravetraveler 14 hours ago | parent | next [-]

Not sure if I'd use the same descriptions so pointedly, but I can see what they mean.

It's perfectly fine to link for convenience, but it does feel a little disrespectful/SEO-y to not 'continue the conversation'. A summary in the very least, how exactly it pertains. Sell us.

In a sense, link-dropping [alone] is saying: "go read this and establish my rhetorical/social position, I'm done here"

Imagine meeting an author/producer/whatever you liked. You'd want to talk about their work, how they created it, the impact it had, and so on. Now imagine if they did that... or if they waved their hand vaguely at a catalog.

simonw 13 hours ago | parent | next [-]

I've genuinely been answering the question "what if the labs are training on your pelican benchmark" 3-4 times a week for several months at this point. I wrote that piece precisely so I didn't have to copy and paste the same arguments into dozens of different conversations.

bravetraveler 12 hours ago | parent [-]

Oh, no. Does this policing job pay well? /s Seriously: less is more, trust the process, any number of platitudes work here. Who are you defending against? Readers, right? You wrote your thing, defended it with more of the thing. It'll permeate. Or it won't. Does it matter?

You could be done, nothing is making you defend this (sorry) asinine benchmark across the internet. Not trying to (m|y)uck your yum, or whatever.

Remember, I did say linking for convenience is fine. We're belaboring the worst reading in comments. Inconsequential, unnecessary heartburn. Link the blog posts together and call it good enough.

Barbing 10 hours ago | parent | next [-]

Surprised to see snark re: what I thought was a standard practice (linking FAQs, essentially).

I hadn’t seen the post. It was relevant. I just read it. Lucky Ten Thousand can read it next time even though I won’t.

Simon has never seemed annoying so unlike other comments that might worry me (even “Opus made this” even though it’s cool but I’m concerned someone astroturfed), that comment would’ve never raised my eyebrows. He’s also dedicated and I love he devotes his time to a new field like this where it’s great to have attempts at benchmarks, folks cutting through chaff, etc.

bravetraveler 10 hours ago | parent [-]

The specific 'question' is a promise to catch training on more publicly available data, and to expect more blog links copied 'into dozens of different conversations'... Jump for joy. Stop the presses. Oops, snarky again :)

Yes, the LLM people will train on this. They will train on absolutely everything [as they have]. The comments/links prioritize engagement over awareness. My point, I suppose, if I had one is that this blogosphere can add to the chaff. I'm glad to see Simon here often/interested.

Aside: all this concern about over-fitting just reinforces my belief these things won't take the profession any time soon. Maybe the job.

simonw 12 hours ago | parent | prev [-]

You don't have to convince me the pelican riding a bicycle SVG benchmark is asinine. That's kind of the point!

bravetraveler 12 hours ago | parent [-]

Having read the followup post being linked, I'm even more confused. Commenting or, really, anything seems even less worthwhile. That's my point.

You bring the benchmark and anticipated their... cheesing, with a promise to catch them on it. Cool announcement of an announcement. Just do that [or don't]. In a hippy sense, this is no longer yours. It's out there. Like everything else anyone wrote.

Let the LLM people train on your test. Catch them as claimed. Publish again. Huzzah, industry without overtime in the comments. It makes sense/cents to position yourself this way :)

Obviously they're going to train on anything they can get. They did. Mouse, meet cat. Some of us in the house would love it if y'all would keep it down! This is 90s rap beef all over again

charcircuit 11 hours ago | parent | prev | next [-]

If you want a summary you can have your ai assistant summarize the link.

bravetraveler 11 hours ago | parent [-]

Woooooosh, please see if an LLM can help you. I'm not getting paid for this

tomrod 14 hours ago | parent | prev [-]

Hell, I would consider myself graced that simonw, yes, THAT simonw, the LLM whisperer, took time out of his busy schedule to send me to a discussion I might have expressed interest in.

bravetraveler 13 hours ago | parent [-]

> send me to a discussion I might have expressed interest in

No, no, remember? Points to the blog you were already reading! Working diligently to build a brand: podcast, paid newsletter, the works.

tomrod 7 hours ago | parent [-]

I wasn't speaking to this interaction, and my point is genuine. Simonw has done fantastic work in the LLM space

th0ma5 15 hours ago | parent | prev [-]

No, when did I say that?

dugidugout 15 hours ago | parent | next [-]

It isn't clear what you said.

You asserted a pattern of conduct on the user simonw:

> I think constantly replying to everybody with some link which doesn't address their concerns

Then claimed that conduct was:

> condescending and disrespectful.

I am asking you to elaborate to whom simonw is condescending and disrespecting. I don't see how it follows.

15 hours ago | parent | prev [-]
[deleted]
Workaccount2 11 hours ago | parent | prev | next [-]

It would be easy to out models that train on the bike pelican, because they would probably suck at the kayaking bumblebee.

So far though, the models good at bike pelican are also good at kayak bumblebee, or whatever other strange combo you can come up with.

So if they are trying to benchmaxx by making SVG generation stronger, that's not really a miss, is it?

majormajor 10 hours ago | parent [-]

That depends on if "SVG generation" is a particularly useful LLM/coding model skill outside of benchmarking. I.e., if they make that stronger with some params that otherwise may have been used for "rust type system awareness" or somesuch, it might be a net loss outside of the benchmarks.

0cf8612b2e1e 11 hours ago | parent | prev | next [-]

I assume all of the models also have variations on, “how many ‘r’s in strawberry”.

thatwasunusual 11 hours ago | parent | prev | next [-]

> We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data.

I may be stupid, but _why_ is this prompt used as a benchmark? I mean, pelicans _can't_ ride a bicycle, so why is it important for "AI" to show that they can (at least visually)?

The "wine glass problem"[0] - and probably others - seems to me to be a lot more relevant...?

[0] https://medium.com/@joe.richardson.iii/the-curious-case-of-t...

simonw 11 hours ago | parent | next [-]

The fact that pelicans can't ride bicycles is pretty much the point of the benchmark! Asking an LLM to draw something that's physically impossible means it can't just "get it right" - seeing how different models (especially at different sizes) handle the problem is surprisingly interesting.

Honestly though, the benchmark was originally meant to be a stupid joke.

I only started taking it slightly more seriously about six months ago, when I noticed that the quality of the pelican drawings really did correspond quite closely to how generally good the underlying models were.

If a model draws a really good picture of a pelican riding a bicycle there's a solid chance it will be great at all sorts of other things. I wish I could explain why that was!

If you start here and scroll through and look at the progression of pelican on bicycle images it's honestly spooky how well they match the vibes of the models they represent: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...

So ever since then I've continue to get models to draw pelicans. I certainly wouldn't suggest anyone take serious decisions on model usage based on my stupid benchmark, but it's a fun first-day initial impression thing and it appears to be a useful signal for which models are worth diving into in more detail.

thatwasunusual 9 hours ago | parent [-]

> If a model draws a really good picture of a pelican riding a bicycle there's a solid chance it will be great at all sorts of other things.

Why?

If I hired a worker that was really good at drawing pelicans riding a bike, it wouldn't tell me anything about his/her other qualities?!

suspended_state 4 hours ago | parent | next [-]

Your comment is funny, but please note: it's not drawing a pelican riding a bike, it's describing in SVG a pelican riding a bike. Your candidate would at least displays some knowledge of the SVG specs.

falcor84 an hour ago | parent | prev | next [-]

For better or worse, a lot of job interviews actually do use contrived questions like this, such as the infamous "how many golf balls can you fit in a 747?"

vikramkr 6 hours ago | parent | prev | next [-]

The difference is that the worker you hire would be a human being and not a large matrix multiplication that had parameters optimized by a a gradient descent process and embeds concepts in a higher dimensional vector space that results in all sorts of weird things like subliminal learning (https://alignment.anthropic.com/2025/subliminal-learning/).

It's not a human intelligence - it's a totally different thing, so why would the same test that you use to evaluate human abilities apply here?

Also more directly the "all sorts of other things" we want llms to be good at often involve writing code/spatial reasoning/world understanding which creating an svg of a pelican riding a bicycle very very directly evaluates so it's not even that surprising?

simonw 8 hours ago | parent | prev | next [-]

I wish I knew why. I didn't think it would be a useful indicator of model skills at all when I started doing it, but over time the pattern has held that performance on pelican riding a bicycle is a good indicator of performance on other tasks.

jtbaker 8 hours ago | parent | prev [-]

a posteriori knowledge. the pelican isn't the point, it's just amusing. the point is that Simon has seen a correlation between this skill and and the model's general capabilities.

wisty 11 hours ago | parent | prev [-]

It's not nessessarily the best benchmark, it's a popular one, probably because it's funny.

Yes it's like the wine glass thing.

Also it's kind of got depth. Does it draw the pelican and the bicycle? Can the penguin reach the peddles? How?

I can imagine a really good AI finding a funny or creative or realistic way for the penguin to reach the peddles.

An slightly worse AI will do an OK job, maybe just making the bike small or the legs too long.

An OK AI will draw a penguin on top of a bicycle and just call it a day.

It's not as binary as the wine glass example.

thatwasunusual 9 hours ago | parent [-]

> It's not nessessarily the best benchmark, it's a popular one, probably because it's funny.

> Yes it's like the wine glass thing.

No, it's not!

That's part of my point; the wine glass scenario is a _realistic_ scenario. The pelican riding a bike is not. It's a _huge_ difference. Why should we measure intelligence (...) in regards to something that is realistic and something that is unrealistic?

I just don't get it.

Fnoord 5 hours ago | parent | next [-]

> the wine glass scenario is a _realistic_ scenario

It is unrealistic because if you go to a restaurant, you don't get served a glass like that. It is frowned upon (alcohol is a drug, after all) and impractical (wine stains are annoying) to fill a glass of wine as such.

A pelican riding a bike, on the other hand, is realistic in a scenario because of TV for children. Example from 1950's animation/comic involving a pelican [1].

[1] https://en.wikipedia.org/wiki/The_Adventures_of_Paddy_the_Pe...

vikramkr 6 hours ago | parent | prev [-]

If the thing we're measuring is a the ability to write code, visually reason, and handle extrapolating to out of sample prompts, then why shouldn't we evaluate it by asking it to write code to generate a strange image that it wouldn't have seen in its training data?

th0ma5 16 hours ago | parent | prev | next [-]

If this had any substance then it could be criticized, which is what they're trying to avoid.

Etheryte 13 hours ago | parent [-]

How? There's no way for you to verify if they put synthetic data for that into the dataset or not.

16 hours ago | parent | prev [-]
[deleted]
baq 17 hours ago | parent | prev | next [-]

but can it recreate the spacejam 1996 website? https://www.spacejam.com/1996/jam.html

aschobel 16 hours ago | parent | next [-]

in case folks are missing the context

https://news.ycombinator.com/item?id=46183294

16 hours ago | parent | prev | next [-]
[deleted]
lagniappe 16 hours ago | parent | prev [-]

That is not a meaningful metric given that we don't live in 1996 and neither do our web standards.

tarsinge 16 hours ago | parent | next [-]

In what year was it meaningful to have pelicans riding bicycles?

lagniappe 16 hours ago | parent [-]

SVG is a current standard. Do not be coy just to satisfy your urge to disagree.

tarsinge 15 hours ago | parent | next [-]

The website is live and renders correctly on my Safari mobile: https://www.spacejam.com/1996/

I may have missed something but where are we saying the website should be recreated with 1996 tech or specs? The model is free to use any modern CSS, there is no technical limitations. So yes I genuinely think it is a good generalization test, because it is indeed not in the training set, and yet it is easy an easy task for a human developer.

locallost 16 hours ago | parent | prev [-]

The point stands. Whether or not the standard is current has no relevance for the ability of the "AI" to produce the requested content. Either it can or can't.

lagniappe 16 hours ago | parent [-]

https://news.ycombinator.com/item?id=46183673

locallost 6 hours ago | parent [-]

> Ergo, models for the most part will only have a cursory knowledge of a spec that your browser will never be able to parse because that isn't the spec that won.

Browsers are able to parse a webpage from 1996. I don't know what the argument in the linked comment is about, but in this one, we discuss the relevance of creating a 1996 page vs a pelican on a a bicycle in SVG.

Here is Gemini when asked how to build a webpage from 1996. Seems pretty correct. In general I dislike grand statements that are difficult to back up. In your case, if models have only a cursory knowledge of something (what does this mean in the context of LLMs anyway), what exactly they were trained on etc.

The shortened Gemini answer, the detailed version you can ask for yourself:

Layout via Tables: Without modern CSS, layouts were created using complex, nested HTML tables and invisible "spacer GIFs" to control white space.

Framesets: Windows were often split into independent sections (like a static sidebar and a scrolling content window) using Frames.

Inline Styling: Formatting was not centralized; fonts and colors were hard-coded individually on every element using the <font> tag.

Low-Bandwidth Design: Visuals relied on tiny tiled background images, animated GIFs, and the limited "Web Safe" color palette.

CGI & Java: Backend processing was handled by Perl/CGI scripts, while advanced interactivity used slow-loading Java Applets.

utopiah 14 hours ago | parent | prev | next [-]

> neither do our web standards

I'd be curious about that actually, feel like W3C specifications (I don't mean browser support of them) rarely deprecate and precisely try to keep the Web running.

baq 16 hours ago | parent | prev | next [-]

Yes, now please prepare an email template which renders fine in outlook using modern web standards. Write it up if you succeed, front page of HN guaranteed!

tomashubelbauer 16 hours ago | parent | prev [-]

The parent comment is a reference to a different story that was on the HN home page yesterday where someone attempted that with Claude.

lagniappe 16 hours ago | parent [-]

Yes, and I had a lengthier response in that thread explaining why this isn't a useful metric.

https://news.ycombinator.com/item?id=46183673

MLgulabio 2 hours ago | parent [-]

It was a joke reference...

willahmad 18 hours ago | parent | prev | next [-]

I think this benchmark could be slightly misleading to assess coding model. But still very good result.

Yes, SVG is code, but not in a sense of executable with verifiable inputs and outputs.

jstummbillig 17 hours ago | parent | next [-]

I love that we are earnestly contemplating the merits of the pelican benchmark. What a timeline.

andrepd 14 hours ago | parent [-]

It's not even halfway up the list of inane things of the AI hype cycle.

hdjrudni 8 hours ago | parent | prev [-]

But it does have a verifiable output, no more or less than HTML+CSS. Not sure what you mean by "input" -- it's not a function that takes in parameters if that's what you're getting at, but not every app does.

iberator 16 hours ago | parent | prev | next [-]

Where did you get llm tool from?!

fauigerzigerk 15 hours ago | parent [-]

He made it: https://github.com/simonw/llm

techsystems 12 hours ago | parent [-]

Cool! I can't find it on the read me, but can it run Qwen locally?

simonw 12 hours ago | parent [-]

The best way to do that at the moment is using the llm-ollama plugin.

cpursley 18 hours ago | parent | prev | next [-]

Skipped the bicycle entirely and upgraded to a sweet motorcycle :)

aorth 18 hours ago | parent | next [-]

Looks like a Cybertruck actually!

BudaDude 17 hours ago | parent [-]

I was thinking a Warthog

https://www.halopedia.org/Warthog

lubujackson 15 hours ago | parent | prev [-]

The Batman motorcycle!

troyvit 14 hours ago | parent [-]

I'm Pelicanman </raspy voice>

taneq 3 hours ago | parent [-]

The Dark Noot.

felixg3 18 hours ago | parent | prev | next [-]

Is it really an svg if it’s just embedded base64 of a jpg

joombaga 16 hours ago | parent [-]

You were seeing the base64 image tag output at the bottom. The SVG input is at the top.

breedmesmn 16 hours ago | parent | prev [-]

Impressive! I'm really excited to leverage this in my gooning sessions!

16 hours ago | parent [-]
[deleted]