I find this extremely surprising.
When you think of everything it takes for an AI to use what the article calls a "vision agent" then it seems as if using a purpose-made API ought to be MANY orders of magnitude faster.