| ▲ | hmmmmmmmmmmmmmm 4 hours ago | |||||||||||||
But it doesn't except on certain benchmarks that likely involves overfitting. Open source models are nowhere to be seen on ARC-AGI. Nothing above 11% on ARC-AGI 1. https://x.com/GregKamradt/status/1948454001886003328 | ||||||||||||||
| ▲ | meffmadd 3 hours ago | parent | next [-] | |||||||||||||
Have you ever used an open model for a bit? I am not saying they are not benchmaxxing but they really do work well and are only getting better. | ||||||||||||||
| ||||||||||||||
| ▲ | Zababa 3 hours ago | parent | prev | next [-] | |||||||||||||
Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"? | ||||||||||||||
| ▲ | doodlesdev 3 hours ago | parent | prev [-] | |||||||||||||
GPT 4o was also terrible at ARC AGI, but it's one of the most loved models of the last few years. Honestly, I'm a huge fan of the ARC AGI series of benchmarks, but I don't believe it corresponds directly to the types of qualities that most people assess whenever using LLMs. | ||||||||||||||
| ||||||||||||||