Remix.run Logo
madihaa 5 hours ago

A polygraph measures physiological proxies pulse, sweat rather than truth. Similarly, RLHF measures proxy signals human preference, output tokens rather than intent.

Just as a sociopath can learn to control their physiological response to beat a polygraph, a deceptively aligned model learns to control its token distribution to beat safety benchmarks. In both cases, the detector is fundamentally flawed because it relies on external signals to judge internal states.