Remix.run Logo
thorio 2 hours ago

I challenged Gemini to answer this too, but also got the correct answer.

What came to my mind was: couldn't all LLM vendors easily fund teams that only track these interesting edge cases and quickly deploy filters for these questions, selectively routing to more expensive models?

Isn't that how they probably game benchmarks too?

moffkalast 2 hours ago | parent [-]

Yes that's potentially why it's already fixed now in some models, since it's about a week after this actually went viral on r/localllama originally. I wouldn't be surprised if most vendors run some kind of swappable lora for quick fixes at this point. It's an endless whac-a-mole of edge cases that show that most LLMs generalize to a much lesser extent than what investors would like people to believe.

Like, this is not an architectural problem unlike the strawberry nonsense, it's some dumb kind of overfitting to a standard "walking is better" answer.