Remix.run Logo
dartos 4 days ago

It’s difficult to reproducibly test openai models, since they can change from under you and you don’t have control over every hyperparameter.

It would’ve been nice to see one of the larger llama models though.

og_kalu 4 days ago | parent [-]

The results are there, it's just hidden away in the appendix. The result is that those models they don't actually suffer drops on 4/5 of their modified benchmarks. The one benchmark that does see actual drops that aren't explained by margin of error is the benchmark that adds "seemingly relevant but ultimately irrelevant information to problems"

Those results are absent from the conclusion because the conclusion falls apart otherwise.