Our analysis shows that current LLMs are unreliable delegates:
Who knew that a tool that relies on probability could make such a mess?