| ▲ | lukko 8 hours ago | |
I'm surprised at both the article and the paper - both seem very hyperbolic. This is LLMs competing against doctors in a way that is heavily weighted in the LLMs favour, which does not represent clinical practice. These reasoning cases are not benchmarks for doctors, they are learning tools. I think it's important to note that diagnosis also relies on accurate description of the patient in the first place, and the information you gather depends on the differential diagnosis. Part of the skill of being a doctor is gathering information from lots of different sources, and trying to filter out what is important. This may be from the patient, who may not be able to communicate clearly or may be non verbal, carers and next of kin. History-taking is a skill in itself, as well as examination. Here those data are given. For pattern recognition from plain text, especially on questions that may be in the o1's training data, I'm not surprised at all that it would outperform doctors, but it doesn't seem to be a clinically useful comparison. Deciding which investigations to do, any imaging, and filtering out unnecessary information from the history is a skill in itself, and can't really be separated from forming the diagnosis. | ||
| ▲ | lokar 7 hours ago | parent [-] | |
Also, you need to see an analysis of the incorrect calls. The goal of a human Dr is not to get the highest accuracy, it's to limit total harm to the patient. There can be cases where the odds favor picking X (but it may not be by that much), but the safe thing to do is to rule out some other option first, or start a safe treatment that covers several other possible options. Simply getting the "high score" on this evaluation is not necessarily good medical treatment. | ||