| ▲ | toraway 5 hours ago | |||||||
That problem went viral weeks ago so is no longer a valid test. At the time it was consistently tripping up all the SOTA models at least 50% of the time (you also have to use a sample > 1 given huge variation from even the exact same wording for each attempt). The large hosted model providers always "fix" these issues as best as they can after they become popular. It's a consistent pattern repeated many times now, benefitting from this exact scenario seemingly "debunking" it well after the fact. Often the original behavior can be replicated after finding sufficient distance of modified wording/numbers/etc from the original prompt. | ||||||||
| ▲ | waisbrot 4 hours ago | parent | next [-] | |||||||
For example, I just asked ChatGPT "The boat wash is 50 meters down the street. Should I drive, sail, or walk there to get my yacht detailed?" and it recommended walking. I'm sure with a tiny bit more effort, OpenAI could patch it to the point where it's a lot harder to confuse with this specific flavor of problem, but it doesn't alter the overall shape. | ||||||||
| ||||||||
| ▲ | reducesuffering 4 hours ago | parent | prev [-] | |||||||
I don't understand what occasional hiccups prove. The models can pass college acceptance tests in advanced educational topics better than 99% of the human population, and because they occasionally have a shortcoming, it means they're worse than humans somehow? Those edge cases are quickly going from 1% -> 0.01% too... "any human can instantly grok the right answer." When asking a human about general world knowledge, they don't have the generality to give good answers for 90% of it. Even very basic questions humans like this, humans will trip up on many many more than the frontier LLMs. | ||||||||