Remix.run Logo
empath75 3 days ago

Models will engage in post-hoc rationalizations, so you can't trust their purported reasoning -- in particular, if you sneak an answer into the context (even an incorrect answer), it will provide reasoning for giving you that as the final answer, even if the answer is wrong. So, it could be arguing backwards from an answer that is in it's training data, you can't possibly tell that it isn't from its reasoning.

On the other hand we do know the training cut off of these models, so you could easily create a corpus of post-cut off Connections with confidence that it doesn't have access to them.

scrollaway 3 days ago | parent [-]

We didn't test using post-hoc reasoning. Instead, we focused on checking whether specific, obscure questions could be recognized or identified in any way, using various ad-hoc methods to see if the answers could be surfaced without relying on reasoning.

It's very difficult to prove either way (and basically impossible without the model weights), but we're reasonably confident that there's no significant prior knowledge of the questions that would affect the score.

mr_wiglaf 3 days ago | parent [-]

I'm new to this sort of inquiry. What do you do to see if questions can be recognized? Do you just ask/prompt "do you recognize this puzzle?"

What does it mean for it to "be surfaced without relying on reasoning"?

scrollaway 3 days ago | parent [-]

> Do you just ask/prompt "do you recognize this puzzle?"

In essence, yes, but with a bit more methodology (though as I mentioned it was all ad-hoc).

We've tried to extract pre-existing questions as well through a variety of "You are a contestant on the british TV show Only Connect" and see if it can recognize questions - couldn't find anything that reliably reproduced preexisting knowledge. It's absolutely possible we missed something.