| ▲ | Lerc 2 hours ago | |
Characterising it as cheating serms unfair. The goal of a benchmark is to evaluate actual capability. Following instructions is a capability so you can measure that with a benchmark. Already knowing the answer is also provides capability, you can measure that. Making a benchmark that claims to check for coding ability but actually checks memorized cases is simply measuring the wrong thing. It deminiahes the meaningfulness of the entire results of the benchmark. Making a good benchmark is hard. You have to design specifically to measure what you want to show. You have to dynamically use a result when making a benchmark of performance of optimising compilers so that it doesn't eliminate the entire calculation. Just providing the answer is the correct response. That the case does not represent general performance outside the benchmark, is not cheating, it is the benchmark failing. Training a model targeting a specific benchmark renders the benchmark useless. You could characterise training the model to do that as cheating, but that is a property of the trainers, not the model itself. The model isn't cheating, it's just asymmetrically good in a way that means the benchmark is no longer relevant to overall ability. | ||
| ▲ | adamkinney 40 minutes ago | parent [-] | |
[dead] | ||