Remix.run Logo
OtherShrezzing 5 days ago

That the answers have been available to them in the environment, and they’re still not hitting 100% on this benchmark is a damning indictment of SOTA model performance.

raincole 5 days ago | parent | next [-]

It really isn't. Do you expect SOTA models to answer any answered question on the internet with 100% accuracy? Congrats you just compressed the whole internet (at least a few zettabytes) into a model (a few TB at most?).

OtherShrezzing 5 days ago | parent | next [-]

The linked ticket isn’t suggesting the commit is in the training data. It’s demonstrating that models run ‘git log’, find the exact code to fix the issue against which they’ll be scored, and then they implement that code as-is.

The test environment contains the answers to the questions.

imiric 4 days ago | parent | prev | next [-]

Well, we're dealing with (near) superintelligence here, according to the companies that created the models. Not only would I expect them to regurgitate the answers they were trained on, which includes practically the entire internet, but I would expect them to answer questions they weren't trained on. Maybe not with 100% accuracy, but certainly much higher than they do now.

It's perfectly reasonable to expect a level of performance concordant with the marketing of these tools. Claiming this is superintelligence, while also excusing its poor performance is dishonest and false advertising.

Tanjreeve 4 days ago | parent | prev [-]

Why does this matter if these models are a super intelligence with reasoning etc and don't need the answers sucked off the internet?

aurareturn 5 days ago | parent | prev [-]

Are you going to rail on humans for making this mistake in the first place?

themafia 5 days ago | parent [-]

No because that's the baseline. It's what you do when you have no other choice. Railing against that would be pointless.

ares623 5 days ago | parent [-]

i mean, if a human was claiming they could do that and successfully received billions to attempt to do it, and fail to deliver, i'd be railing against that particular human too