▲ | Alupis 13 hours ago | ||||||||||||||||
I think the main difference is an AI judge may provide three different rulings if you just ask it the same thing three times. A human judge is much less likely to be so "flip-floppy". You can observe this using any of the present-day LLM's - ask it an architectural/design question, provide it with your thoughts, reasoning, constraints, etc... and see what it tells you. Then... click the "Retry" button and see how similar (or dissimilar) the answer is. Sometimes you'll get a complete 180 from the prior response. | |||||||||||||||||
▲ | bunderbunder 12 hours ago | parent | next [-] | ||||||||||||||||
Humans flip-flop all the time. This is a major reason why the Meyers-Briggs Type Indicator does such a poor job of assigning the same person the same Meyers-Briggs type on successive tests. It can be difficult to observe this fact in practice because, unlike for an LLM, you can't just ask a human the exact same question three times in five seconds and get three different answers, because unlike an LLM we have memory. But, as someone who works with human-labeled data, it's something I have to contend with on a daily basis. For the things I'm working on, if you give the same annotator the same thing to label two different times spaced far enough apart for them to forget that they have seen this thing before, the chance of them making the same call both times is only about 75%. If I do that with a prompted LLM anotator, I'm used to seeing more like 85%, and for some models it can be possible to get even better consistency than that with the right conditions and enough time spent fussing with the prompt. I still prefer the human labels when I can afford them because LLM labeling has plenty of other problems. But being more flip-floppy than humans is not one that I have been able to empirically observe. | |||||||||||||||||
| |||||||||||||||||
▲ | acdha 10 hours ago | parent | prev [-] | ||||||||||||||||
I’d think there’s also a key adversarial problem: a human judge has a conceptual understanding and you aren’t going to be able to slightly tweak your wording to get wildly different outcomes the way LLMs are vulnerable to. |