>We were unable to find evidence that the Only Connect games are in the training materials (which of course is likely to change now).

I just don't think this is a credible assumption. The BBC is one of the highest-trusted sources of millions of hours online audio/visual content, all of which is accompanied by human curated & edited closed captions. All of which is trivially easy to download. The base assumptions should be that the entire BBC iPlayer corpus is inside of all the frontier model training datasets.

The communities on Reddit (known to be included in all models) extensively discuss each show & question - usually creating google docs tracking the questions asked and answers given.

Finally, there's the OCDB[0], which lists every question and answer on the show.

While using real questions from the show, this benchmark should be assumed to be testing the model fact-recall ability, rather than its reasoning capabilities.

[1]https://ocdb.cc/

▲

scrollaway 3 days ago | parent | next [-]

To clarify what I meant by this: Despite looking, we haven't seen any evidence of any of the models consistently responding based on pre-trained knowledge (outside of easier-to-guess trivia-type questions). It's likely the questions are in some form in the training data but it doesn't necessarily mean the results will be significantly influenced.

▲

empath75 3 days ago | parent | next [-]

Models will engage in post-hoc rationalizations, so you can't trust their purported reasoning -- in particular, if you sneak an answer into the context (even an incorrect answer), it will provide reasoning for giving you that as the final answer, even if the answer is wrong. So, it could be arguing backwards from an answer that is in it's training data, you can't possibly tell that it isn't from its reasoning.

On the other hand we do know the training cut off of these models, so you could easily create a corpus of post-cut off Connections with confidence that it doesn't have access to them.

▲

scrollaway 3 days ago | parent [-]

We didn't test using post-hoc reasoning. Instead, we focused on checking whether specific, obscure questions could be recognized or identified in any way, using various ad-hoc methods to see if the answers could be surfaced without relying on reasoning.

It's very difficult to prove either way (and basically impossible without the model weights), but we're reasonably confident that there's no significant prior knowledge of the questions that would affect the score.

▲

mr_wiglaf 3 days ago | parent [-]

I'm new to this sort of inquiry. What do you do to see if questions can be recognized? Do you just ask/prompt "do you recognize this puzzle?"

What does it mean for it to "be surfaced without relying on reasoning"?

	▲	scrollaway 3 days ago \| parent [-]
		> Do you just ask/prompt "do you recognize this puzzle?" In essence, yes, but with a bit more methodology (though as I mentioned it was all ad-hoc). We've tried to extract pre-existing questions as well through a variety of "You are a contestant on the british TV show Only Connect" and see if it can recognize questions - couldn't find anything that reliably reproduced preexisting knowledge. It's absolutely possible we missed something.

▲

andrepd 3 days ago | parent | prev | next [-]

There's a database of every question and answer, and almost every episode is also on youtube, complete with transcripts. I really don't see how you can assume that the fact that questions+answers are in the training data (which they are) doesn't affect the results of your "benchmark"...

It also doesn't pass the smell test. These models routinely make basic mistakes, yet can answer these devilish lateral thinking questions more than 9 times out of 10? Seems very strange.

	▲	scrollaway 3 days ago \| parent [-]
		> These models routinely make basic mistakes, yet can answer these devilish lateral thinking questions more than 9 times out of 10? You could also say "These models routinely make basic mistakes, yet they're able to one-shot write entire webpages and computer programs that compile with no errors". There are classes of mistakes the models make, this is what we're digging into.

▲

bgwalter 3 days ago | parent | prev [-]

How can the results not be influenced if Grok for example lists all questions and answers of a particular episode if asked?

It is as easy as the lion/goat/cabbage riddle in canonical form.

▲

3 days ago | parent | prev | next [-]

[deleted]

▲

ZeroGravitas 3 days ago | parent | prev [-]

They also published an official quiz book with questions from the show (and some new content):

https://openlibrary.org/works/OL20812628W