| ▲ | ipaddr 4 days ago | |
This could be a solved problem. Come up with problems not online and compare. Later use LLMs to sort through your problems and classify between easy-difficult | ||
| ▲ | vlovich123 4 days ago | parent | next [-] | |
Hard to do for an industry benchmark since doing the test in such a mode requires sending the question to the LLM which then basically puts it into a public training set. This has been tried multiple times by multiple people and it ends up not doing so great over time in terms of retaining immunity to “cheating”. | ||
| ▲ | kalkin 4 days ago | parent | prev [-] | |
How do you imagine existing benchmarks were created? | ||