Remix.run Logo
orbital-decay a day ago

If that prompt can be easily trained against, it probably doesn't exploit a generic bias. These are not that interesting, and there's no point in hiding them.

daedrdev a day ago | parent | next [-]

generic biases can also be fixed

orbital-decay a day ago | parent [-]

*Some generic biases. Some others like recency bias, serial-position effect, "pink elephant" effect, negation accuracy seem to be pretty fundamental and are unlikely to be fixed without architectural changes, or at all. Things exploiting in-context learning and native context formatting are also hard to suppress during the training without making the model worse.

fwip a day ago | parent | prev [-]

Sure there is. If you want to know if students understand the material, you don't hand out the answers to the test ahead of time.

Collecting a bunch of "Hard questions for LLMs" in one place will invariably result in Goodhart's law (When a measure becomes a target, it ceases to be a good measure). You'll have no idea if the next round of LLMs is better because they're generally smarter, or because they were trained specifically on these questions.