Remix.run Logo
nickandbro 6 hours ago

What we have all been waiting for:

"Create me a SVG of a pelican riding on a bicycle"

https://www.svgviewer.dev/s/FfhmhTK1

Thev00d00 6 hours ago | parent | next [-]

That is pretty impressive.

So impressive it makes you wonder if someone has noticed it being used a benchmark prompt.

burkaman 6 hours ago | parent | next [-]

Simon says if he gets a suspiciously good result he'll just try a bunch of other absurd animal/vehicle combinations to see if they trained a special case: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

jmmcd 5 hours ago | parent | next [-]

"Pelican on bicycle" is one special case, but the problem (and the interesting point) is that with LLMs, they are always generalising. If a lab focussed specially on pelicans on bicycles, they would as a by-product improve performance on, say, tigers on rollercoasters. This is new and counter-intuitive to most ML/AI people.

BoorishBears 2 hours ago | parent [-]

The gold standard for cheating on a benchmark is SFT and ignoring memorization. That's why the standard for quickly testing for benchmark contamination has always been to switch out specifics of the task.

Like replacing named concepts with nonsense words in reasoning benchmarks.

ddalex 5 hours ago | parent | prev [-]

https://www.svgviewer.dev/s/TVk9pqGE giraffe in a ferrari

rixed 5 hours ago | parent | prev | next [-]

I have tried combinations of hard to draw vehicle and animals (crocodile, frog, pterodactly, riding a hand glider, tricycle, skydiving), and it did a rather good job in every cases (compared to previous tests). Whatever they have done to improve on that point, they did it in a way that generalise.

6 hours ago | parent | prev [-]
[deleted]
bitshiftfaced 6 hours ago | parent | prev [-]

It hadn't occurred to me until now that the pelican could overcome the short legs issue by not sitting on the seat and instead put its legs inside the frame of the bike. That's probably closer to how a real pelican would ride a bike, even if it wasn't deliberate.

xnx 6 hours ago | parent [-]

Very aero