Remix.run Logo
mentalgear 8 hours ago

I understand the 'fun factor' but at this point I really wonder what this pelican still proofs ? I mean, providers certainly could have adapted for it if they wanted, and if you want to test how well a model adapts to potential out of distribution contexts, it might be more worthwhile to mix different animals with different activity types (a whale on a skateboard) than always the same.

simonw 8 hours ago | parent | next [-]

That's why I did the flamingo on a unicycle.

For a delightful moment this morning I thought I might have finally caught a model provider cheating by training for the pelican, but the flamingo convinced me that wasn't the case.

furyofantares 7 hours ago | parent | next [-]

It is completely wild to me that you prefer Qwen's flamingo. I think it's really bad and Opus' is pretty good.

simonw 7 hours ago | parent [-]

The Opus one doesn't even have a bowtie.

furyofantares 7 hours ago | parent | next [-]

The Opus one looks like a flamingo, and looks like it's riding the unicycle. Sitting on the seat. Feet on the pedals.

The Qwen one looks like a 3-tailed, broken-winged, beakless (I guess? Is that offset white thing a beak? Or is it chewing on a pelican feather like it's a piece of straw?) monstrosity not sitting on the seat, with its one foot off the pedal (the other chopped off at the knee) of a malmanufactured wheel that has bonus spokes that are longer than the wheel.

But yeah, it does have a bowtie and sunglasses that you didn't ask for! Plus it says "<3 Flamingo on a Unicycle <3", which perhaps resolves all ambiguity.

bigyabai 5 hours ago | parent [-]

Let's not oversell Opus' output. The Qwen flamingo is flawed but could be easily fixed with 1-2 prompts if you're really upset with it. The Opus SVG is not any better than something that I could make in Inkscape with 3 minutes and sufficient motivation. Calling Opus' flamingo "programmer art" would be an insult to programmers.

monksy 6 hours ago | parent | prev [-]

Game over opus

solarkraft 2 hours ago | parent | prev | next [-]

If I (commercially) made models I’d put specific care into producing SVGs of various animals doing (riding) various things ... I find it interesting how confident you seem to be that they’re not.

akavel 7 hours ago | parent | prev | next [-]

r/LocalLlama is now doing a horse in a racing car:

https://redd.it/1slz38i

prodigycorp 8 hours ago | parent | prev | next [-]

To me the opus flamingo is waaaay better than the qwen one. qwen has the better pelican, though.

dude250711 8 hours ago | parent | prev [-]

Is a flamingo on a unicycle not merely a special case of a pelican on a bicycle?

luyu_wu 4 hours ago | parent | prev | next [-]

Consider reading the article, which addresses all of the points you raise.

It's directly stated in the post that the entire test is meant to be humorous, not taken seriously, only that is has vaguely followed model performance to date. The author also writes that this new result shows that trend has broken..

stephbook 6 hours ago | parent | prev | next [-]

They're certainly aware of the test, but a turtle doing a kickflip on a skateboard? I seriously doubt they train their models for that.

https://x.com/JeffDean/status/2024525132266688757

If anything, the disastrous Opus4.7 pelican shows us they don't pelicanmaxx

bitwize 6 hours ago | parent [-]

I think I found the leaked Claude Mythos version of the turtle benchmark: https://www.youtube.com/watch?v=l82XWTKLZuk

BoorishBears 6 hours ago | parent | prev [-]

This is a gag that's long outlived its humor, but we're in a space so driven by hype there are people who will unironically take some signal from it. They'll swear up and down they know it's for fun, but let a great pelican come out and see if they don't wave it as proof the model is great alongside their carwash test.