Remix.run Logo
libraryofbabel 6 days ago

and as others have pointed out, this issue of “how much should I check” is really just a subset of an old general problem in trust and knowledge (“epistemology” or what have you) that people have recognized since at least the scientific revolution. The Royal Society’s motto on its founding in the 1660s was “take no man’s word for it.”

Coding agents have now got pretty good at checking themselves against reality, at least for things where they can run unit tests or a compiler to surface errors. That would catch the error in TFA. Of course there is still more checking to do down the line, in code reviews etc, but that goes for humans too. (This is not to say that humans and LLMs should be treated the same here, but nor do I treat an intern’s code and a staff engineer’s code the same.) It’s a complex issue that we can’t really collapse into “LLMs are useless because they get things wrong sometimes.”

AllegedAlec 6 days ago | parent [-]

> Coding agents have now got pretty good at checking themselves against reality, at least for things where they can run unit tests or a compiler to surface errors.

YMMV. I've seen Claude go completely batshit insane saying that tests all passed. Then I run them and I see 50+ failures. I copy the output tell him to fix it and he goes on his sycophantic apologia before spinning his wheels doing nothing and saying all tests are back to green.