Remix.run Logo
Wowfunhappy 6 hours ago

Aww, I don’t like the new pelican benchmark as much. I liked that the old prompt was vague and we could see how the AI interpreted it.

ahmedfromtunis 5 hours ago | parent [-]

Yeah. The new challenge seems easier to solve since it basically is hand-holding the LLMs into what the result should look like.

I think a more challenging, well, challenge, would be to offer an even more absurd scenario and see how the model handles it.

Example: generate an svg of a pelican and a mongoose eating popcorn inside a pyramid-shaped vehicle flying around Jupiter. Result: https://imgur.com/a/TBGYChc

simonw 5 hours ago | parent [-]

I like the hand-holding because it's a better test of how well models can follow more detailed instructions.

I was inspired by Max Woolf's nano banana test prompts: https://minimaxir.com/2025/11/nano-banana-prompts/

ahmedfromtunis 5 hours ago | parent [-]

That's a valid point but I'd argue the new test would be then interesting to couple with the original one, not to replace it.

Do you think it would be reasonable to include both in future reviews, at least for the sake of back-compatibility (and comparability)?

simonw 5 hours ago | parent [-]

Yeah I'm going to keep on using the old one as well.