You only read the first line of the summary. This is the juicy bit:
> The LLM alone scored 16 percentage points (95% CI, 2-30 percentage points; P = .03) higher than the conventional resources group.
Basically they setup the experiment as a control group and a LLM-assisted group. There was no difference between the two groups and that is what was reported in the top level finding that you quote.
Then they went back and said “wait, what if we just blindly trusted the LLM? What if we had a third group that had no doctor involved — just let the LLM do the diagnosis?” This retroactively synthesized group did significantly better than either of the actual experimental groups:
> The LLM alone scored 16 percentage points (95% CI, 2-30 percentage points; P = .03) higher than the conventional resources group … The LLM alone demonstrated higher performance than both physician groups, indicating the need for technology and workforce development to realize the potential of physician-artificial intelligence collaboration in clinical practice.