Remix.run Logo
zerobees 3 hours ago

I know that people with strong feelings one way or the other will comment here, but note that this is specifically about problems with known answers that can be inferred from existing literature (e.g., training data).

This is an interesting result, but as I understand it, it's not about solving frontier challenges (which LLMs can evidently do too, but that's not what's tested here). It's closer to "can a mathematician (blindly) write exercises you can't cheat on using an LLM". "Blindly" in the sense that they can't adjust the problem ahead of the time until they get a model to fail.

The conclusion in the paper is: "The concept of writing exercise-style benchmark questions based on publicly accessible research has reached its limits when it comes to the best-performing available models."

christianstump 2 hours ago | parent | next [-]

Let me also add: there is zero chance of the problems being included in the training data. The results are quite impressive: leading experts struggled to write questions with well-defined unique answers on existing research that the models were not able to solve.

This should not be interpreted as AI can solve mathematics: the ability to solve exercise-style questions based on existing research is vastly different from the creation of new mathematics.

But it is still impressive and not what we expected -- I rather expected that we end with 20-40 questions no current publicly available model can solve.

lightningspirit 2 hours ago | parent | prev [-]

I think most of the value LLMs provide comes from connecting the dots between unsolved questions and patterns or structures that have already been demonstrated, which accelerates research.

Now, reasoning in the sense of making truly original discoveries, as Einstein did with the field equations, is a different story for current LLMs.