Remix.run Logo
layer8 2 days ago

> We can just run the code and see if the output is what we expected

There is a vast gap between the output happening to be what you expect and code being actually correct.

That is, in a way, also the fundamental issue with LLMs: They are designed to produce “expected” output, not correct output.

etra0 16 hours ago | parent | next [-]

That is exactly my point, though.

I didn't mean they do it on the first time, or that it is correct, I mean that you can 'run' and 'test it' to see if it does what you want in the way you want.

The same cannot be said to any other topics like medical advice, life advice, etc.

The point is, how verifiable is the output the LLM gives and so how useful it is.

layer8 9 hours ago | parent [-]

My point is that running and testing the code successfully doesn’t prove correctness, doesn’t show that “it does what you want in the way you want” under all circumstances. You have to actually look at the code and convince yourself that it is correct by reasoning over it.

Verdex a day ago | parent | prev [-]

For example:

The output is correct but only for one input.

The output is correct for all inputs but only with the mocked dependency.

The output looks correct but the downstream processors expected something else.

The output is correct for all inputs with real world dependencies and is in the correct structure for downstream processors, but it's not being registered with the schema filtered and it all gets deleted in prod.

While implementing the correct function you fail to notice that the correct in every way output doesn't conform to that thing that Tom said because you didn't code it yourself but instead let the LLM do it. The system works flawlessly with itself but the final output fails regulatory compliance.