| ▲ | stephc_int13 4 hours ago | ||||||||||||||||
What I would do if I was in the position of a large company in this space is to arrange an internal team to create an ARC replica, covering very similar puzzles and use that as part of the training. Ultimately, most benchmarks can be gamed and their real utility is thus short-lived. But I think this is also fair to use any means to beat it. | |||||||||||||||||
| ▲ | tylervigen 4 hours ago | parent | next [-] | ||||||||||||||||
I agree that for any given test, you could build a specific pipeline to optimize for that test. I supposed that's why it is helpful to have many tests. However, many people have worked hard to optimize tools specifically for ARC over many years, and it's proven to be a particularly hard test to optimize for. This is why I find it so interesting that LLMs can do it well at all, regardless of whether tests like it are included in training. | |||||||||||||||||
| |||||||||||||||||
| ▲ | AstroBen an hour ago | parent | prev | next [-] | ||||||||||||||||
Is "good at benchmarks instead of real world tasks" really something to optimize for? What does this achieve? Surely people would be initially impressed, try it out, be underwhelmed and then move on. That's not great for Google | |||||||||||||||||
| |||||||||||||||||
| ▲ | 2 hours ago | parent | prev | next [-] | ||||||||||||||||
| [deleted] | |||||||||||||||||
| ▲ | simpsond 4 hours ago | parent | prev [-] | ||||||||||||||||
Humans study for tests. They just tend to forget. | |||||||||||||||||