▲ | og_kalu 4 days ago | |
The results are there, it's just hidden away in the appendix. The result is that those models they don't actually suffer drops on 4/5 of their modified benchmarks. The one benchmark that does see actual drops that aren't explained by margin of error is the benchmark that adds "seemingly relevant but ultimately irrelevant information to problems" Those results are absent from the conclusion because the conclusion falls apart otherwise. |