| ▲ | serf 5 hours ago | |
>we're just teaching them how to pass a polygraph. I understand the metaphor, but using 'pass a polygraph' as a measure of truthfulness or deception is dangerous in that it alludes to the polygraph as being a realistic measure of those metrics -- it is not. | ||
| ▲ | nwah1 5 hours ago | parent | next [-] | |
That was the point. Look up Goodhart's Law | ||
| ▲ | AndrewKemendo 5 hours ago | parent | prev | next [-] | |
I have passed multiple CI polys A poly is only testing one thing: can you convince the polygrapher that you can lie successfully | ||
| ▲ | madihaa 5 hours ago | parent | prev [-] | |
A polygraph measures physiological proxies pulse, sweat rather than truth. Similarly, RLHF measures proxy signals human preference, output tokens rather than intent. Just as a sociopath can learn to control their physiological response to beat a polygraph, a deceptively aligned model learns to control its token distribution to beat safety benchmarks. In both cases, the detector is fundamentally flawed because it relies on external signals to judge internal states. | ||