Interestingly, I've seen weaker models get a similar "riddle" right while a stronger one fails. It may be that the models need to be of a certain size to learn to overfit the riddles.