To clarify, the more accurate description would be "Testing how well LLMs can follow the rules of Magic", right? There is no actual evaluation of how "well" they are playing?