Wall clock time tells me everything I need to know. The vision model took almost 20 minutes to do the thing that Sonnet did in 20 seconds.
The only reason you wouldn’t choose an API is if it wasn’t viable.