I agree. I do wonder if what I'm seeing is a limitation of the reasoning power of LLMs or if it's just replicating the patterns (or lack thereof) in the training data.