▲ | tveita a day ago | ||||||||||||||||
You want to know if a new model is actually better, which you won't know if they just added the specific example to the training set. It's like handing a dev on your team some failing test cases, and they keep just adding special cases to make the tests pass. How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry? I guess they have a bunch of different wine glasses in their image set now, since that was a meme, but they still completely fail to draw an open book with the cover side up. | |||||||||||||||||
▲ | gwern a day ago | parent | next [-] | ||||||||||||||||
> How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry? Well, that's easy: zero. Because even a single training example would 'solved' it by memorizing the simple easy answer within weeks of 'strawberry' first going viral , which was like a year and a half ago at this point - and dozens of minor and major model upgrades since. And yet, the strawberry example kept working for most (all?) of that time. So you can tell that if anything, OA probably put in extra work to filter all those variants out of the training data... | |||||||||||||||||
| |||||||||||||||||
▲ | fennecbutt 18 hours ago | parent | prev [-] | ||||||||||||||||
I always point out how the strawberry thing is a semi pointless exercise anyway. Because it gets tokenised, of course a model could never count the rs. But I suppose if we want these models to be capable of anything then these things need to be accounted for. |