I haven’t read the paper yet, but I’d imagine the issue is converting the natural language generated by the reasoner into a form where a formal verifier can be applied.