It's surprisingly capable. One tricky problem is trying to solve Captchas with it.
Multimodal LLMs can solve captchas easily if they're allowed to.