| ▲ | behnamoh 4 hours ago | |||||||
[flagged] | ||||||||
| ▲ | smokel 3 hours ago | parent | next [-] | |||||||
I'll bite. The benchmark is actually pretty good. It shows in an extremely comprehensible way how far LLMs have come. Someone not in the know has a hard time understanding what 65.4% means on "Terminal-Bench 2.0". Comparing some crappy pelicans on bicycles is a lot easier. | ||||||||
| ||||||||
| ▲ | quinnjh 3 hours ago | parent | prev [-] | |||||||
the field is advancing so fast it's hard to do real science as their will be a new SOTA by the time you're ready to publish results. i think this is a combination of that and people having a laugh. Would you mind sharing which benchmarks you think are useful measures for multimodal reasoning? | ||||||||
| ||||||||