Remix.run Logo
Calavar 7 hours ago

> But it’s getting harder and harder to define a task that humans beat LLMs on. On pretty much any easily quantifiable test of knowledge or reasoning, the machines win.

Quite to the contrary, I think it's extremely trivial to find a task where humans beat LLMs.

For all the money that's been thrown at agentic coding, LLMs still produce substantially worse code than a senior dev. See my own prior comments on this for a concrete example [1].

These trivial failure cases show that there are dimensions to task proficiency - significant ones - that benchmarks fail to capture.

> Is medical diagnosis one of these high judgement tasks?

Situational. I would break diagnosis into three types:

1. The diagnosis comes from objective criteria - laboratory values, vital signs, visual findings, family history. I think LLMs are likely already superior to humans in this case.

2. The diagnosis comes from "chart lore" - reading notes from prior physicians and realizing that there is new context now points to a different diagnosis. (That new context can be the benefit of hindsight into what they already tried and failed and/or new objective data). LLMs do pretty good at this when you point them at datasets where all the prior notes were written by humans, which means that those humans did a nontrivial part of the diagnostic work. What if the prior notes were written by LLMs as well? Will they propagate their own mistakes forward? Yet to be studied in depth.

3. The diagnosis comes from human interaction - knowing the difference between a patient who's high as a bat on crack and one who's delirious from infection; noticing that a patient hesitates slightly before they assure you that they've been taking all their meds as prescribed; etc. I doubt that LLMs will ever beat humans at this, but if LLMs can be proven to be good at point 2, then point 3 alone will not save human physicians.

[1] https://news.ycombinator.com/threads?id=Calavar#47891432