Remix.run Logo
alienbaby 19 hours ago

It doesn't teat the models ability to make good decisions on its own, it tests the models ability to make something that 'works'. Often you look inside and it does a whole load of questionable things that mostly work, sure, but if you say and designed it properly yourself you would likely come up with something for more sane and maintainable.

LoganDark 13 hours ago | parent [-]

That's only the fault of particular benchmarks, and that's also why it's important to offer the outputs in question that resulted in a particular score. I'm not sure that all or even most benchmarks do this, but it's important when selecting a model.