| ▲ | thatwasunusual 9 hours ago | |
> It's not nessessarily the best benchmark, it's a popular one, probably because it's funny. > Yes it's like the wine glass thing. No, it's not! That's part of my point; the wine glass scenario is a _realistic_ scenario. The pelican riding a bike is not. It's a _huge_ difference. Why should we measure intelligence (...) in regards to something that is realistic and something that is unrealistic? I just don't get it. | ||
| ▲ | Fnoord 5 hours ago | parent | next [-] | |
> the wine glass scenario is a _realistic_ scenario It is unrealistic because if you go to a restaurant, you don't get served a glass like that. It is frowned upon (alcohol is a drug, after all) and impractical (wine stains are annoying) to fill a glass of wine as such. A pelican riding a bike, on the other hand, is realistic in a scenario because of TV for children. Example from 1950's animation/comic involving a pelican [1]. [1] https://en.wikipedia.org/wiki/The_Adventures_of_Paddy_the_Pe... | ||
| ▲ | vikramkr 6 hours ago | parent | prev [-] | |
If the thing we're measuring is a the ability to write code, visually reason, and handle extrapolating to out of sample prompts, then why shouldn't we evaluate it by asking it to write code to generate a strange image that it wouldn't have seen in its training data? | ||