| ▲ | alienbaby 19 hours ago | |
It doesn't teat the models ability to make good decisions on its own, it tests the models ability to make something that 'works'. Often you look inside and it does a whole load of questionable things that mostly work, sure, but if you say and designed it properly yourself you would likely come up with something for more sane and maintainable. | ||
| ▲ | LoganDark 13 hours ago | parent [-] | |
That's only the fault of particular benchmarks, and that's also why it's important to offer the outputs in question that resulted in a particular score. I'm not sure that all or even most benchmarks do this, but it's important when selecting a model. | ||