▲ | lattalayta 3 days ago | |
I haven't been following them that closely, but are people finding these benchmarks relevant? It seems like these companies could just tune their models to do well on particular benchmarks | ||
▲ | mickael-kerjean 3 days ago | parent | next [-] | |
The benchmark is something you can optimize for, doesn't mean it generalize well. Yesterday I tried for 2 hours to get claude to create a program that would extract data from a weird adobe file. 10$ later, the best I had is a program that was doing something like:
| ||
▲ | emp17344 3 days ago | parent | prev [-] | |
That’s exactly what’s happening. I’m not convinced there’s any real progress occurring here. |