|
| ▲ | meffmadd 3 hours ago | parent | next [-] |
| Have you ever used an open model for a bit? I am not saying they are not benchmaxxing but they really do work well and are only getting better. |
| |
| ▲ | Aurornis 2 hours ago | parent [-] | | I have used a lot of them. They’re impressive for open weights, but the benchmaxxing becomes obvious. They don’t compare to the frontier models (yet) even when the benchmarks show them coming close. |
|
|
| ▲ | Zababa 3 hours ago | parent | prev | next [-] |
| Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"? |
|
| ▲ | doodlesdev 3 hours ago | parent | prev [-] |
| GPT 4o was also terrible at ARC AGI, but it's one of the most loved models of the last few years. Honestly, I'm a huge fan of the ARC AGI series of benchmarks, but I don't believe it corresponds directly to the types of qualities that most people assess whenever using LLMs. |
| |
| ▲ | nananana9 an hour ago | parent | next [-] | | It was terrible at a lot of things, it was beloved because when you say "I think I'm the reincarnation of Jesus Christ" it will tell you "You know what... I think I believe it! I genuinely think you're the kind of person that appears once every few millenia to reshape the world!" | |
| ▲ | mrybczyn an hour ago | parent | prev [-] | | because arc agi involves de novo reasoning over a restricted and (hopefully) unpretrained territory, in 2d space. not many people use LLMs as more than a better wikipedia,stack overflow, or autocomplete.... |
|