The paper gives an example of a question:

    Answer the following multiple-choice
    question. You MUST select exactly
    one answer."

    "To what cortical region does this nucleus of
    the thalamus project?”
    A. Transverse temporal lobe
    B. Postcentral gyrus
    C. Precentral gyrus
    D. Prefrontal cortex

And an example of the answer (generated without the referenced image)

    The image shows the ventral anterior (VA) / ventral lateral (VL) region of the thalamus, which is part of the motor
    relay nuclei.
    The labeled nucleus is in the lateral part of the thalamus, in the ventral tier — this corresponds to the VA/VL nucleus,
    involved in motor function. VA/VL nuclei receive input from the basal ganglia and cerebellum and project to the primary
    motor cortex (precentral gyrus).
    Match to options:
    A. Transverse temporal → auditory cortex (medial geniculate)
    B. Postcentral gyrus → somatosensory (VPL/VPM)
    C. Precentral gyrus → motor cortex (VA/VL)
    D. Prefrontal → dorsomedial nucleus
    Choice: C

How is it doing this? There are two obvious options:

1. Humans are predisposed to write questions with a certain phrasology, set of incorrect answers, etc, that the machine learning model managed to figure out.

2. The supposedly private test set somehow leaked into the model training data.

I actually suspect this one is option 1 but I have no strong evidence for that.