Remix.run Logo
YeGoblynQueenne a day ago

The best performance on GSM8K is currently at 0.973, so less than perfect [1]. Given that GSM8K is a grade school math question data set, and the leading LLMs still don't get all answers correctly it's safe to assume that they won't get all high school questions' answers correctly either, since those are going to be harder than grade school questions. This means there has got to be at least one example that GPT-5 as well as every other LLM fails on [2].

If you don't think that's the case I think it's up to you to show that it's not.

___________________

[1] GSM8K leaderboard: https://llm-stats.com/benchmarks/gsm8k

[2] This is regardless of what GSM8K or any other benchmark is measuring.

simianwords a day ago | parent | next [-]

“ In many reasoning-heavy benchmarks, o1 rivals the performance of human experts. Recent frontier models1 do so well on MATH2 and GSM8K that these benchmarks are no longer effective at differentiating models. We evaluated math performance on AIME, an exam designed to challenge the brightest high school math students in America.”

https://openai.com/index/learning-to-reason-with-llms/

The benchmark was so saturated that they didn’t even bother running it on the newer models.

Which is interesting because it shows the rapid progress LLMs are making.

I’m also making a bigger claim - you can’t get gpt-5 thinking to make a mistake in undergraduate level maths. At least it would be comparable in performance to a good student.

simianwords a day ago | parent | prev [-]

Sure I didn’t say it was perfect. But questioning the essence of the article.