| |
| ▲ | adastra22 4 hours ago | parent | next [-] | | Actually it works the other way. With multiple agents they can often correct each others mistaken assumptions. Part of the value of this approach is precisely that you do get better results with fewer hallucinated assumptions. Still makes this change from Anthropic stupid. | | |
| ▲ | rco8786 3 hours ago | parent | next [-] | | The corrective agent has the exact same percentage chance at making the mistake. "Correcting" an assumption that was previously correct into an incorrect one. If a singular agent has a 1% chance of making an incorrect assumption, then 10 agents have that same 1% chance in aggregate. | | |
| ▲ | adastra22 2 hours ago | parent [-] | | You are assuming statistical independence, which is explicitly not correct here. There is also an error in your analysis - what matters is whether they make the same wrong assumption. That is far less likely, and becomes exponentially unlikely with increasing trials. I can attest that it works well in practice, and my organization is already deploying this technique internally. | | |
| ▲ | thesz an hour ago | parent [-] | | How several wrong assumptions make it right with increasing trials? | | |
| ▲ | adastra22 an hour ago | parent [-] | | You can ask Opus 4.6 to do a task and leave it running for 30min or more to attempt one-shooting it. Imagine doing this with three agents in parallel in three separate work trees. Then spin up a new agent to decide which approach of the three is best on the merits. Repeat this analysis in fresh contexts and sample until there is clear consensus on one. If no consensus after N runs, reframe to provide directions for a 4th attempt. Continue until a clear winning approach is found. This is one example of an orchestration workflow. There are others. | | |
| ▲ | thesz 25 minutes ago | parent [-] | | > Then spin up a new agent to decide which approach of the three is best on the merits. Repeat this analysis in fresh contexts and sample until there is clear consensus on one.
If there are several agents doing analysis of solutions, how do you define a consensus? Should it be unanimous or above some threshold? Are agents scores soft or hard? How threshold is defined if scores are soft? There is a whole lot of science in voting approaches, which voting approach is best here?Is it possible for analyzing agents to choose the best of wrong solutions? E.g., longest remembered table of FizzBuzz answers amongst remembered tables of FizzBuzz answers. |
|
|
|
| |
| ▲ | groundzeros2015 3 hours ago | parent | prev [-] | | Nonsense. If you have 16 binary decisions that’s 64k possible paths. | | |
| ▲ | adastra22 2 hours ago | parent [-] | | These are not independent samplings. | | |
| ▲ | groundzeros2015 2 hours ago | parent [-] | | Indeed. Doesn’t that make it worse? Prior decisions will bring up path dependent options ensuring they aren’t even close to the same path. | | |
| ▲ | adastra22 an hour ago | parent [-] | | Run a code review agent, and ask it to identify issues. For each issue, run multiple independent agents to perform independent verification of this issue. There will always be some that concur and some that disagree. But the probability distributions are vastly different for real issues vs hallucinations. If it is a real issue they are more likely to happen upon it. If it is a hallucination, they are more likely to discover the inconsistency on fresh examination. This is NOT the same as asking “are you sure?” The sycophantic nature of LLMs would make them biased on that. But fresh agents with unbiased, detached framing in the prompt will show behavior that is probabilistically consistent with the underlying truth. Consistent enough for teasing out signal from noise with agent orchestration. |
|
|
|
| |
| ▲ | peyton 4 hours ago | parent | prev [-] | | Take a look at the latest Codex on very-high. Claude’s astroturfed IMHO. | | |
| ▲ | rco8786 3 hours ago | parent [-] | | Can you explain more? I'm talking about LLM/agent behavior in a generalized sense, even though I used claude code as the example here. What is Codex doing differently to solve for this problem? |
|
|