| ▲ | Case study: Creative math – How AI fakes proofs(tomaszmachnik.pl) | |||||||||||||||||||
| 42 points by musculus 4 hours ago | 19 comments | ||||||||||||||||||||
| ▲ | simonw an hour ago | parent | next [-] | |||||||||||||||||||
Somewhat ironic that the author calls out model mistakes and then presents https://tomaszmachnik.pl/gemini-fix-en.html - a technique they claim reduces hallucinations which looks wildly superstitious to me. It involves spinning a whole yarn to the model about how it was trained to compete against other models but now it's won so it's safe for it to admit when it doesn't know something. I call this a superstition because the author provides no proof that all of that lengthy argument with the model is necessary. Does replacing that lengthy text with "if you aren't sure of the answer say you don't know" have the same exact effect? | ||||||||||||||||||||
| ||||||||||||||||||||
| ▲ | v_CodeSentinal 2 hours ago | parent | prev | next [-] | |||||||||||||||||||
This is the classic 'plausible hallucination' problem. In my own testing with coding agents, we see this constantly—LLMs will invent a method that sounds correct but doesn't exist in the library. The only fix is tight verification loops. You can't trust the generative step without a deterministic compilation/execution step immediately following it. The model needs to be punished/corrected by the environment, not just by the prompter. | ||||||||||||||||||||
| ||||||||||||||||||||
| ▲ | tombert 12 minutes ago | parent | prev | next [-] | |||||||||||||||||||
I remember when ChatGPT first came out, I asked it for a proof for Fermat's Last Theorem, which it happily gave me. It was fascinating, because it was doing a lot of understandable mistakes that 7th graders make. For example, I don't remember the surrounding context but it decided that you could break `sqrt(x^2 + y^2)` into `sqrt(x^2) + sqrt(y^2) => x + y`. It's interesting because it was one of those "ASSUME FALSE" proofs; if you can assume false, then mathematical proofs become considerably easier. | ||||||||||||||||||||
| ▲ | godelski 25 minutes ago | parent | prev | next [-] | |||||||||||||||||||
I thought it funny a few weeks ago Karpathy shared a sample od NanoBannana solving some physics problems but despite getting the right output it isn't get the right answers. I think it's quite illustrative of the problem even with coding LLMs. Code and math proofs aren't so different, what matters is the steps to generate the output. All that matters far more than the actual output. The output is meaningless if the steps to get there aren't correct. You can't just jump to the last line of a proof to determine its correctness and similarly you can't just look at a program's output to determine its correctness. Checking output is a great way to invalidate them but do nothing to validate. Maybe what surprised me most is that the mistakes NanoBananna made are simple enough that I'm absolutely positive Karpathy could have caught them. Even if his physics is very rusty. I'm often left wondering if people really are true believers and becoming blind to the mistakes or if they don't care. It's fine to make mistakes but I rarely see corrections and let's be honest here, these are mistakes that people of this caliber should not be making. I expect most people here can find multiple mistakes with the physics problem. One can be found if you know what the derivative of e^x is and another can be found if you can count how many i's there are. The AI cheats because it's focused on the output, not the answer. We won't solve this problem till we recognize the output and answer aren't synonymous | ||||||||||||||||||||
| ▲ | threethirtytwo 2 hours ago | parent | prev | next [-] | |||||||||||||||||||
You don’t need a test to know this we already know there’s heavy reinforcement training done on these models so it optimizes for passing the training. Passing the training means convincing the person rating the answers and that the answer is good. The keyword is convince. So it just needs to convince people that’s it’s right. It is optimizing for convincing people. Out of all answers that can convince people some can be actual correct answers, others can be wrong answers. | ||||||||||||||||||||
| ||||||||||||||||||||
| ▲ | bwfan123 2 hours ago | parent | prev | next [-] | |||||||||||||||||||
I am actually surprised that the LLM came so close. I doubt it had examples in its training set for these numbers. This goes to the heart of "know-how". The LLM should should have said: "I am not sure" but instead gets into rhetoric to justify itself. It actually mimics human behavior for motivated reasoning. At orgs, management is impressed with this overconfident motivated reasoner as it mirrors themselves. To hell with the facts, and the truth, persuation is all that matters. | ||||||||||||||||||||
| ▲ | benreesman 3 hours ago | parent | prev | next [-] | |||||||||||||||||||
They can all write lean4 now, don't accept numbers that don't carry proofs. The CAS I use for builds has a coeffect discharge cert in the attestation header, couple lines of code. Graded monads are a snap in CIC. | ||||||||||||||||||||
| ||||||||||||||||||||
| ▲ | semessier 2 hours ago | parent | prev | next [-] | |||||||||||||||||||
that's not a proof | ||||||||||||||||||||
| ||||||||||||||||||||
| ▲ | fragmede 3 hours ago | parent | prev | next [-] | |||||||||||||||||||
> a session with Gemini 2.5 Pro (without Code Execution tools) How good are you at programming on a whiteboard? How good is anybody? With code execution tools withheld from me, I'll freely admit that I'm pretty shit at programming. Hell, I barely remember the syntax in some of the more esoteric, unpracticed places of my knowledge. Thus, it's hard not to see case studies like this as dunking on a blindfolded free throw shooter, and calling it analysis. | ||||||||||||||||||||
| ||||||||||||||||||||
| ▲ | rakmo 2 hours ago | parent | prev [-] | |||||||||||||||||||
Is this hallucination, or is this actually quite human (albeit a specific type of human)? Think of slimy caricatures like a used car salesman, isn't this the exact type of underhandedness you'd expect? | ||||||||||||||||||||