> Once specs are captured as tests, the LLM can no longer hallucinate.

Tests are not a correctness proof. I can’t trust LLMs to correctly reason about their code, and tests are merely a sanity check, they can’t verify that the code was correctly reasoned.

▲

survirtual an hour ago | parent [-]

They do not need to be correctness proofs. With appropriate prompting and auditing, the tests allow the LLM see if the code functions as expected and iterates. It also serves as functionality documentation and audit documentation.

I also actually do not care if it reasons properly. I care about results that eventually stabilizes on a valid solution. These results do not need to based on "thinking," it can be experimentally derived. Agents can own whatever domain they work in, and acquire results with whatever methods they choose given constraints they are subject to. I measure results by validating via e2e tests, penetration testing, and human testing.

I also measure via architecture agents and code review agents that validate adherence to standards. If standards are violated a deeper audit is conducted, if it becomes a pattern, the agent is modified until it stabilizes again.

This is more like numerical methods of relaxation. You set the edge conditions / constraints, then iterate the system until it stabilizes on a solution. The solution in this case, however, is meta, because you are stabilizing on a set of agents that can stabilize on a solution.

Agents don't "reason" or "think", and I don't need to trust them. I trust only results.

	▲	layer8 an hour ago \| parent [-]
		The point is that tests generally only test specific inputs and circumstances. They are a heuristic, but don’t generalize to all possible states and inputs. It’s like probing a mathematical function on some points, where the results being correct on the probed points doesn’t mean the function will yield the desired result on all points of its domain. If the tests are the only measure, they become the target. The value of a good developer is that they generalize over all possible inputs and states. That’s something current LLMs can’t be trusted to do (yet?).