| ▲ | simonw 2 hours ago |
| Pelican for Fable 5 on default settings is a clear improvement on Opus 4.8 Fable 5 default: https://gist.github.com/simonw/036bee5a703e7ec84e34efa974438... Opus 4.8 (the "max" one is closest to Fable): https://simonwillison.net/2026/May/28/claude-opus-4-8/#and-s... Now here are the Fable pelicans for all five of the thinking effort levels - low, medium, high, xhigh, max: https://tools.simonwillison.net/markdown-svg-renderer#url=ht... Low used 25 input, 1,929 output - 9.67 cents: https://www.llm-prices.com/#it=25&ot=1929&sel=claude-fable-5 Max used 25 input, 14,430 output - 72.175 cents! https://www.llm-prices.com/#it=25&ot=14430&sel=claude-fable-... |
|
| ▲ | sempron64 2 hours ago | parent | next [-] |
| The pelican has looked very same-y across all frontier models, same color bike, same camera angle, etc. I suspect this challenge is already too embedded in the training data to be a good signal when it succeeds, and maybe even when it fails in pathological ways mirroring existing AI pelicans on the internet. |
| |
| ▲ | tripleee 2 hours ago | parent | next [-] | | I'd say it's working great for its intended purpose. Keeps Simon on top of all these threads and funnels traffic to his site. | | | |
| ▲ | h4ny 19 minutes ago | parent | prev | next [-] | | Was it ever a good test? How do you even objectively assess what a good pelican on a bike is anyway? | | |
| ▲ | fwipsy 14 minutes ago | parent [-] | | SVG generation is a good test because it's extremely easy to subjectively assess with visual reasoning where humans are strong. However, pelican on a bike specifically may be overused at this point. |
| |
| ▲ | quantumwoke an hour ago | parent | prev [-] | | Variations of this comment have been posted for over a year. The pelican has now morphed into part of HN culture rather than a legitimate benchmark, but it's still valuable as a meme. |
|
|
| ▲ | sarreph 2 hours ago | parent | prev | next [-] |
| I'm beginning to wonder how much of a useful metric the pelican is because surely the frontier labs must be training their models on pelican-artistry because of how well known your test is now? |
| |
| ▲ | bensyverson 2 hours ago | parent | next [-] | | Simon has addressed this on virtually every new model release. He also has unpublished alternate prompts. But the larger point is: this is a fun experiment, not a serious and objective benchmark. | | |
| ▲ | refulgentis an hour ago | parent [-] | | It's silly and a joke and a surprisingly good benchmark and don't take it seriously but don't take not taking it seriously seriously and if it's too good we use another prompt but don't actually because then it's not the pelican post and there's obvious ways to better it and it's not worth doing because it's not serious. Only coherent move at this point: hit the minus button immediately. There's never anything about the model in the thread other than simon's post. | | |
| ▲ | stasomatic 37 minutes ago | parent [-] | | But what if they are better at flamingos? Are they optimized for pelicans? How about “draw me a four headed owl”? The meme, I get it, but I’d settle for a working bash script, tbh. |
|
| |
| ▲ | wongarsu 2 hours ago | parent | prev | next [-] | | I just run my own benchmark for "draw an SVG with $animal driving $vehicle". I won't post my choice of animal and mode of transport, but there are plenty of uncommon combinations to choose from. So far it's a fun and visually intuitive benchmark that does seem to correlate with model capabilities | |
| ▲ | modriano 2 hours ago | parent | prev | next [-] | | I don't know. Just looking at the bike frames (specifically the fact that the AI generated bikes have rather unsteerable front forks), it's clear to me that frontier labs aren't spending much time tuning models to make bikes look coherent, which I assume is an easier task than making a pelican riding a bike look coherent. | |
| ▲ | iLoveOncall 29 minutes ago | parent | prev | next [-] | | It was a completely useless test even before the labs trained for it. | |
| ▲ | HaZeust 2 hours ago | parent | prev [-] | | I've seen this reply to Simon's benchmark for 2 years running now, and yet you still see improvements and objectively-bad results over time from new releases, even when I'm sure every frontier AI team has/had a person at least partially dedicated to better bicycle-pelican SVG outputs. Alas. | | |
| ▲ | sarreph 2 hours ago | parent | next [-] | | I had intended to caveat that: I'm sure I'm not the first person to ask about this! > you still see improvements This is expected if they are training their models on it, right? > objectively-bad results Keen to learn when this has been the case, i.e. across version increments in major models. | | | |
| ▲ | llm_nerd 2 hours ago | parent | prev [-] | | I honestly assumed their comment was tongue in cheek humour, because positively no one actually cares how these models generate an SVG pelican riding a bicycle. It's some meme thing that this stuff always appears here. | | |
| ▲ | BrokenCogs 2 hours ago | parent [-] | | Yeah this is not a real benchmark, it's just a fun tradition everytime a new model is released | | |
|
|
|
|
| ▲ | ealready_value 2 hours ago | parent | prev | next [-] |
| This is the reply I look for in all the new model announcements. Its fun to tell people that I judge models based on pelicans. |
| |
| ▲ | upcoming-sesame 21 minutes ago | parent | next [-] | | I also look for this reply because i like seeing the follow-up reply saying that this is not a benchmark anymore because labs have gotten it in their training data. that reply never failed to come it's basically a meme at this point | |
| ▲ | pixel_popping 2 hours ago | parent | prev | next [-] | | This is all we need, that moment the Pelican put the leg behind the frame, we are all doomed. | |
| ▲ | chorkpop 2 hours ago | parent | prev [-] | | Now someone post the link about how it’s impossible for humans to draw a bike from memory. |
|
|
| ▲ | ethanlipson 2 hours ago | parent | prev | next [-] |
| How much money do you think they spent fine-tuning on pelican SVG generation? |
| |
|
| ▲ | leecommamichael 2 hours ago | parent | prev | next [-] |
| Looks like Fable constructed the "max" "looking" pelican of the previous model for the "xhigh" output token count of the previous model. |
|
| ▲ | rkuska 2 hours ago | parent | prev | next [-] |
| Is it possible to use the credits from subscription (https://support.claude.com/en/articles/15036540-use-the-clau...) for fable? |
|
| ▲ | jerryliu12 an hour ago | parent | prev | next [-] |
| Personally feel like it could be more ambitious with what it creates. |
|
| ▲ | 382hi 2 hours ago | parent | prev | next [-] |
| I'm pretty sure they're optimizing the models around these sorts of tests. |
|
| ▲ | redox99 2 hours ago | parent | prev | next [-] |
| It's interesting that they still get the head tube / handle bar part wrong. |
| |
|
| ▲ | gavinray 43 minutes ago | parent | prev | next [-] |
| Fable 5 xhigh actually looks the best to me. |
|
| ▲ | makingstuffs 2 hours ago | parent | prev | next [-] |
| I could be tripping but I’m sure that is very similar to the Deepseek one from not long ago. Clearly I am too lazy to go and find it for verification. |
|
| ▲ | mercacona 2 hours ago | parent | prev | next [-] |
| Why always sunny days? |
| |
|
| ▲ | csomar 2 hours ago | parent | prev | next [-] |
| Where is the clear improvement on Fable 5? The tail is misplaced. |
|
| ▲ | david_shi 2 hours ago | parent | prev | next [-] |
| that's a great looking pelican |
|
| ▲ | ge96 2 hours ago | parent | prev | next [-] |
| need more Alex Moulton style bikes |
|
| ▲ | simunskxcsckss 2 hours ago | parent | prev | next [-] |
| [flagged] |
| |
| ▲ | minimaxir 2 hours ago | parent | next [-] | | You can't tell someone to "get a life" while taking the effort to create a burner account for the sole purpose of insulting someone. | |
| ▲ | rvz 2 hours ago | parent | prev | next [-] | | I don't really consider that a great benchmark anyway and we really need better ones that are objective instead of these mostly performative and cheatable and also available in the training set. | |
| ▲ | ilaksh 2 hours ago | parent | prev [-] | | Simon's pelicans are an institution. Are you trying to get banned. Lmao. | | |
| ▲ | rob 2 hours ago | parent | next [-] | | I think it's a clever thing he did to basically guarantee he continues to get major traffic to his blog here every time a model is released, especially since he's taking sponsorships with a static banner at the top of every page now. I think he's trying to go the Daring Fireball route. | |
| ▲ | 2 hours ago | parent | prev | next [-] | | [deleted] | |
| ▲ | brazukadev 2 hours ago | parent | prev [-] | | For me it is like if crypto bros were allowed to shill their DAOs and tokens during the crypto/NFT phase. He is the only person not getting rate-limited for shilling AI all the time. | | |
| ▲ | simonw 2 hours ago | parent [-] | | Pointing out how much the models still suck at drawing pelicans is a funny way to shill them. | | |
| ▲ | toraway 2 hours ago | parent [-] | | Tbf the first line of your first comment is: > Pelican for Fable 5 on default settings is a clear improvement on Opus 4.8
And doesn't contain any actual criticism within the comment (your blog post might, but just referring to what was posted on HN, which is a bit booster-y on its own). | | |
| ▲ | simonw an hour ago | parent [-] | | The entire pelican benchmark is a joke. The joke is that, for all of the billions of dollars poured into these things and the claims of PhD level intelligence, they still draw pelicans not-much-better than a five year-old would. I don't spell that joke out in every comment I post here because that wouldn't be very funny. |
|
|
|
|
|
|
| ▲ | kylehotchkiss 2 hours ago | parent | prev [-] |
| How many barrels of oil are burned per pelican at Fable levels? |