| ▲ | smokel 5 hours ago | |
I'll bite. The benchmark is actually pretty good. It shows in an extremely comprehensible way how far LLMs have come. Someone not in the know has a hard time understanding what 65.4% means on "Terminal-Bench 2.0". Comparing some crappy pelicans on bicycles is a lot easier. | ||
| ▲ | blibble 3 hours ago | parent [-] | |
it ceases to be a useful benchmark of general ability when you post it publicly for them to train against | ||