Remix.run Logo
andrepd 3 days ago

There's a database of every question and answer, and almost every episode is also on youtube, complete with transcripts. I really don't see how you can assume that the fact that questions+answers are in the training data (which they are) doesn't affect the results of your "benchmark"...

It also doesn't pass the smell test. These models routinely make basic mistakes, yet can answer these devilish lateral thinking questions more than 9 times out of 10? Seems very strange.

scrollaway 3 days ago | parent [-]

> These models routinely make basic mistakes, yet can answer these devilish lateral thinking questions more than 9 times out of 10?

You could also say "These models routinely make basic mistakes, yet they're able to one-shot write entire webpages and computer programs that compile with no errors".

There are classes of mistakes the models make, this is what we're digging into.