Remix.run Logo
kypro 6 days ago

> LLMs get endlessly confused: they assume the code they wrote actually works; when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over.

That's actually an interesting point, and something I've noticed a lot myself. I find LLMs are very good at hacking around test failures, but unless the test is failing for a trivial reason often it's pointing at some more fundamental issue with the underlying logic of the application which LLMs don't seem to be able to pick up on, likely because they don't have a comprehensive mental model of how the system should work.

I don't want to point fingers, but I've been seeing this quite a bit in the code of colleagues who heavily use LLMs. On the surface the code looks fine, and they've produced tests which pass, but when you think about it for more than a minute you realise it doesn't really capture nuance of the requirements, and in a way a human who had a mental model of the how the system probably wouldn't have done...

Sometimes humans miss things in the logic when they're writing code, but these look more like mistakes in a line rather than a fundamental failure to comprehend and model the problem. And I know this isn't the case, because when you talk to these developers they get the problem perfectly well.

To know when the code needs fixing or a test you need a very clear idea of what should be happening and LLMs just don't. I don't know why that is. Maybe it's just they're missing context from the hours of reading tickets and technical discussions, or maybe it's their failure to ask questions when they're unsure of what should be happening. I don't know if this a fundamental limitation of LLMs (I'd suspect not personally), but this is a problem when using LLMs to code today and one that more compute alone probably can't fix.