The commercial services likely also have frontier model dependencies...

The opening to the actual paper is quite explicit that (i) other studies have already tested commercial apps with with unimpressive results and (ii) a popular open source app for carb counting directly relies on API calls from these frontier models, and this research batch tested the images used the exact same models and prompts as the popular open source app.

▲

azakai 16 hours ago | parent [-]

A carb counting app might use API calls to these frontier models and then do some kind of analysis. It could see if different models agree or not, or multiple calls, and with how much variance.

So it would be more accurate to test the apps rather than the APIs, unless the goal is to warn people that just open chatgpt and ask there.

	▲	notahacker 14 hours ago \| parent [-]
		The open source app could in theory do that, but the paper's authors would be able to determine whether it did or not by reading its code, which they evidently did to replicate the API calls it made with their own script. (And of course it would also be far more tedious to submit each picture 500 times manually using an app and manually log the response than using a script which is designed to collect the data automatically as fast as API rate limits permit)