I don't think the author of the article found "good evidence". He found a specific case where there was a regression. This could be due to:
- models actually getting worse in general
- his specific style of prompting working well with older models and less well with newer models
- the thing his test tests no longer being a priority for big AI labs
From the article:
> GPT-4 gave a useful answer every one of the 10 times that I ran it. In three cases, it ignored my instructions to return only code, and explained that the column was likely missing from my dataset, and that I would have to address it there.
Here ignoring the instructions to give a "useful answer" (as evaluated by the author) is considered a good thing. This would mean if a model is trained to be better at instruction following, it would lose points in that test.
To me this article feels a bit like saying "this new gun that shoot straight 100% of the time is worse than the older gun that shot straight only 50% of the time, because sometimes I shoot at something I don't actually want to shoot at!". And in a way, it is true, if you're used to being able to shoot at things without them getting hurt, the new gun will be worse from that point of view. But to spin up a whole theory about garbage in/garbage out from that? Or to think all models are getting worse rather than, you're maybe no longer the target audience? That seems weird to me.