| ▲ | lattalayta 8 months ago | |
I haven't been following them that closely, but are people finding these benchmarks relevant? It seems like these companies could just tune their models to do well on particular benchmarks | ||
| ▲ | mickael-kerjean 8 months ago | parent | next [-] | |
The benchmark is something you can optimize for, doesn't mean it generalize well. Yesterday I tried for 2 hours to get claude to create a program that would extract data from a weird adobe file. 10$ later, the best I had is a program that was doing something like: | ||
| ▲ | emp17344 8 months ago | parent | prev [-] | |
That’s exactly what’s happening. I’m not convinced there’s any real progress occurring here. | ||