They do not need to be correctness proofs. With appropriate prompting and auditing, the tests allow the LLM see if the code functions as expected and iterates. It also serves as functionality documentation and audit documentation.

I also actually do not care if it reasons properly. I care about results that eventually stabilizes on a valid solution. These results do not need to based on "thinking," it can be experimentally derived. Agents can own whatever domain they work in, and acquire results with whatever methods they choose given constraints they are subject to. I measure results by validating via e2e tests, penetration testing, and human testing.

I also measure via architecture agents and code review agents that validate adherence to standards. If standards are violated a deeper audit is conducted, if it becomes a pattern, the agent is modified until it stabilizes again.

This is more like numerical methods of relaxation. You set the edge conditions / constraints, then iterate the system until it stabilizes on a solution. The solution in this case, however, is meta, because you are stabilizing on a set of agents that can stabilize on a solution.

Agents don't "reason" or "think", and I don't need to trust them. I trust only results.

▲

layer8 3 hours ago | parent [-]

The point is that tests generally only test specific inputs and circumstances. They are a heuristic, but don’t generalize to all possible states and inputs. It’s like probing a mathematical function on some points, where the results being correct on the probed points doesn’t mean the function will yield the desired result on all points of its domain. If the tests are the only measure, they become the target.

The value of a good developer is that they generalize over all possible inputs and states. That’s something current LLMs can’t be trusted to do (yet?).

	▲	survirtual 6 minutes ago \| parent [-]
		Not relevant. Hallucinations don't matter if the mechanics of the pipeline mitigate them. In other words, at a systems level, you can mitigate hallucinations. The agent level noise is not a concern. This is no different from CPU design or any other noisy system. Transistors are not perfect and there is always error, so you need error correction. At a transistor level, CPUs are unreliable. At a systems level, they are clean and reliable. This is no different. The stochastic noisiness of individual agents can be mitigated with redundancy, constraints, and error correction at a systems level.