▲ | baxtr 11 hours ago | |||||||||||||||||||||||||||||||
I’d argue real judges are unreliable as well. The real question for me is: are they less reliable than human judges? Probably yes. But I favor a relative measurement to humans than a plain statement like that. | ||||||||||||||||||||||||||||||||
▲ | Alupis 11 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||
I think the main difference is an AI judge may provide three different rulings if you just ask it the same thing three times. A human judge is much less likely to be so "flip-floppy". You can observe this using any of the present-day LLM's - ask it an architectural/design question, provide it with your thoughts, reasoning, constraints, etc... and see what it tells you. Then... click the "Retry" button and see how similar (or dissimilar) the answer is. Sometimes you'll get a complete 180 from the prior response. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
▲ | Terr_ 11 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
> The real question for me is: are they less reliable than human judges? I'd caution that it's never just about ratios: We must also ask whether the "shape" of their performance is knowable and desirable. A chess robot's win-rate may be wonderful, but we are unthinkingly confident a human wouldn't "lose" by disqualification for ripping off an opponent's finger. Would we accept a "judge" that is fairer on average... but gives ~5% lighter sentences to people with a certain color shirt, or sometimes issues the death-penalty for shoplifting? Especially when we cannot diagnose the problem or be sure we fixed it? (Maybe, but hopefully not without a lot of debate over the risks!) In contrast, there's a huge body of... of stuff regarding human errors, resources we deploy so pervasively it can escape our awareness: Your brain is a simulation and diagnostic tool for other brains, battle-tested (sometimes literally) over millions of years; we intuit many kinds of problems or confounding factors to look for, often because we've made them ourselves; and thousands of years of cultural practice for detection, guardrails, and error-compensating actions. Only a small minority of that toolkit can be reused for "AI." | ||||||||||||||||||||||||||||||||
▲ | andrewla 11 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
I do think you've hit the heart of the question, but I don't think we can answer the second question. We can measure how unreliable they are, or how susceptible they are to specific changes, just because we can reset them to the same state and run the experiment again. At least for now [1] we do not have that capability with humans, so there's no way to run a matching experiment on humans. The best we can do it is probably to run the limited experiments we can do on humans -- comparing different judge's cross-referenced reliability to get an overall measure and some weak indicator of the reliability of a specific judge based on intra-judge agreement. But when running this on LLMs they would have to keep the previous cases in their context window to get a fair comparison. | ||||||||||||||||||||||||||||||||
▲ | bunderbunder 11 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
> The real question for me is: are they less reliable than human judges? I've spent some time poking at this. I can't go into details, but the short answer is, "Sometimes yes, sometimes no, and it depends A LOT on how you define 'reliable'." My sense is that, the more boring, mechanical and closed-ended the task is, the more likely an LLM is to be more reliable than a human. Because an LLM is an unthinking machine. It doesn't get tired, or hangry, or stressed out about its kid's problems at school. But it's also a doofus with absolutely no common sense whatsoever. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
▲ | resource_waste 11 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
I don't study domestic law enough, but I asked a professor of law: "With anything gray, does the stronger/bigger party always win?" He said: "If you ask my students, nearly all of them would say Yes" | ||||||||||||||||||||||||||||||||
▲ | nimitkalra 11 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
There are technical quirks that make LLM judges particularly high variance, sensitive to artifacts in the prompt, and positively/negatively-skewed, as opposed to the subjectivity of human judges. These largely arise from their training distribution and post-training, and can be contained with careful calibration. | ||||||||||||||||||||||||||||||||
▲ | not_maz 8 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
I know the answer and I hate it. AIs are inferior to humans at their best, but superior to humans as they actually behave in society, due to decision fatigue and other constraints. When it comes to moral judgment in high stakes scenarios, AIs still fail (or can be made to fail) in ways that are not socially acceptable. Compare an AI to a real-world, overworked corporate decision maker, though, and you'll find that the AI is kinder and less biased. It still sucks, because GI/GO, but it's slightly better, simply because it doesn't suffer emotional fatigue, doesn't take as many shortcuts, and isn't clouded by personal opinions since it's not a person. | ||||||||||||||||||||||||||||||||
▲ | andrepd 10 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
Judges can reason according to principles, and explain this reasoning. LLMs cannot (but they can pretend to, and this pretend chain-of-thought can be marketed as "reasoning"!; see https://news.ycombinator.com/item?id=44069991) | ||||||||||||||||||||||||||||||||
▲ | th0ma5 11 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
Can we stop with the "AI being unreliable like people" because it is demonstrably false at best and cult like thought termination at the worst. | ||||||||||||||||||||||||||||||||
▲ | DonHopkins 11 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
At least LLMs don't use penis pumps while on the job in court. https://www.findlaw.com/legalblogs/legally-weird/judge-who-u... | ||||||||||||||||||||||||||||||||
▲ | wetpaws 11 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||
[dead] |