A process is described here: https://arxiv.org/pdf/2506.22405

>A physician or AI begins with a short case abstract and must iteratively request additional details from a gatekeeper model that reveals findings only when explicitly queried. Performance is assessed not just by diagnostic accuracy but also by the cost of physician visits and tests performed.

▲

apical_dendrite 2 days ago | parent | next [-]

I believe that dataset was built off of cases that were selected for being unusual enough for physicians to submit to the New England Journal of Medicine. The real-world diagnostic accuracy of physicians in these cases was 100% - the hospital figured out a diagnosis and wrote it up. In the real world these cases are solved by a team of human doctors working together, consulting with different specialists. Comparing the model's results to the results of a single human physician - particularly when all the irrelevant details have been stripped away and you're just left with the clean case report - isn't really reflective of how medicine works in practice. They're also not the kind of situations that you as a patient are likely to experience, and your doctor probably sees them rarely if ever.

	▲	atleastoptimal 2 days ago \| parent [-]
		Either way, the AI model performed better than the humans on average, so it would be reasonable to infer that AI would be a net positive collaborator in a team of internists.

▲

sorcerer-mar 3 days ago | parent | prev [-]

Okay you have a point. AI probably would do really well when short case abstracts start walking into clinics.

▲

atleastoptimal 3 days ago | parent [-]

How else would a study scientifically determine the accuracy of an AI model in diagnosis? By testing it on real people before they know how good it is?

	▲	rafaelmn 3 days ago \| parent [-]
		Why not ? Have AI do it then have human doctor do a follow-up/review ? I might not be a fan of this for urgent care but for general visits I wouldn't mind spending a bit extra time if they it was followed by an expert exam.