| ▲ | cheema33 2 hours ago | |
This is good work. When a task is of critical importance, I give two different LLMs the same task. And then ask them to review each other's output and validate all claims. I do this with Codex and Claude Code. It is very rare for them to find some valid fault in the other LLM's solution. And they are generally good about admitting mistakes and then creating a single unified solution that addresses identified issues. This result is better and ready for human review. | ||
| ▲ | anonymous908213 a few seconds ago | parent [-] | |
I recall this article about an annoying LLM experiment[1], wherein a variety of sandboxed 'agents' were given free reign with instructions to send spam e-mails to NGOs. Setting aside the lack of ethics involved in wasting people's time with this nonsense, the process did produce a marginally useful artifact indicating how horrendous of an idea it is to use LLMs to police other LLMs. Rather than converging on the truth, LLM reviews exacerbated fabrications and made them unbelievably worse. After cold e-mailing NGOs for a project, one LLM reported to the others about an automated rejection e-mail and the other LLMs eventually reinterpretated that rejection e-mail into a success story that was then used to pitch their fabricated project in new e-mails, invoking the endorsement of a real organization that had in fact automatically rejected them. Giving two LLMs the same task and asking them to review each other is an exercise in generative storytelling, and reducing real-world work to generative storytelling is ridiculously irresponsible. [1]https://theaidigest.org/village/blog/what-do-we-tell-the-hum... | ||