Remix.run Logo
endymion-light 19 hours ago

There's an incredibly serious lack of education with how LLMs & carb-counting works. This entire article would be better suited to astrology.com than hackernews.

When I opened it up, I assumed the author would have at least attempted a calculation service, maybe even placed something like the size of the meal into an actual model, using the integration of pre-existing tools that are (slightly more) accurate. Hell - most food literally is required to have calorie information, and you can query open source data for others!

But the author just took pictures of food & expected a realistic response? Is this genuinely what amounts to a study in AI?

This is akin to the instagram reels that talk to chatGPT and ask it to time how long they're run is. Except those are treated as funny jokes rather than being turned into studies.

I'd like to see this study done using any kind of actual grounding knowledge, seeing what mistakes AI makes when attempting to query ground truth from picture analysis - there would at least be an interesting result methodology in that.

kalleboo 18 hours ago | parent | next [-]

> But the author just took pictures of food & expected a realistic response?

There are very popular apps on the App Store right now that are going viral among non-techie people that do exactly this, and they have no concept of how AI works. My wife was talking about one and I had to give her a reality check that the AI had no idea what ingredients were used to make the food. And she's a licensed nutritionalist.

Studies like this create something to point at for people who are confused and serve as a springboard for a conversation in the media.

whazor 2 hours ago | parent | next [-]

The real benchmark should be comparing the amounts with a human guess. And aa far as I know with diabetes if you are within 30% of guessing carbs then you should be fine.

endymion-light 18 hours ago | parent | prev | next [-]

That's true - I suppose i'm just dissapointed that this study hasn't seemed to include those within any analysis. Being able to point out that the top 100 calorie counting apps on the app store return similiar results to simple frontier models would be of interest.

I think i'm just dissapointed that this study doesn't go deep enough, and stays at a surface level statistical analysis of frontier models.

dpark 15 hours ago | parent [-]

I think it’s a very useful study specifically to debunk the apps that support this flow.

None of those apps have magic. They cannot do better than the frontier models.

asdfasgasdgasdg 17 hours ago | parent | prev | next [-]

To be fair these kinds of apps also existed before LLMs. They just used OpenCV or similar instead of the LLM APIs.

inerte 16 hours ago | parent | prev | next [-]

To be fair my expectations is that those apps have done the prompt engineering, and schema, and tools (to query nutrition database), etc... and although they're not 100% consistent, the margin of errors should be narrow to the point that barely matter, and they should do a bit better than a random ChatGPT chat session.

Centigonal 15 hours ago | parent [-]

the problem isn't one that can be solved with prompts. If I gave a panel of food and nutrition experts (human or machine) a bunch of pictures of food, they still wouldn't be able to tell if, e.g. a slice of cake was made with whole milk or skim.

The "pic of packaged food --> LLM --> nutrition DB call" pipeline is workable, but many users of these apps are using them for fresh prepared foods, which is just an unworkable problem without either an understanding of the preparation process or a bomb calorimeter.

xnx 15 hours ago | parent | prev | next [-]

Even simpler examples make the limitations obvious. Images can't distinguish Diet Coke from Coke.

senordevnyc 17 hours ago | parent | prev [-]

licensed nutritionalist

Nutritionist?

kalleboo 6 hours ago | parent | next [-]

Haha oops. English is hard...

Insanity 17 hours ago | parent | prev [-]

[flagged]

busssard 16 hours ago | parent [-]

[flagged]

furyofantares 19 hours ago | parent | prev | next [-]

From the text of the article I believe the author is implying there are apps doing exactly this, and so this is why it was studied that way.

Had the author written the article themselves rather than an LLM their motivation probably would have been clearer.

Brendinooo 19 hours ago | parent | next [-]

> there are apps doing exactly this

Yeah, for sure there are. And people will just ask ChatGPT as well.

The funny thing is that for people who are just trying to lose weight without managing any health issues precisely, this type of extreme variance doesn't really matter, because the mere act of consciously quantifying food consumption is, based on my experience counting calories, the single biggest factor in success with weight loss.

criley2 18 hours ago | parent [-]

I actually think "just asking ChatGPT" is fine, because A) the data in these apps is suspect at best and B) the data behind calories is also pretty suspect (but we all play along because we can adjust other variables to make it all "work" well enough).

Once or twice a year I spend a few weeks meticulously measuring ingredients/cooked foods and recording calories and on complex recipes apps are next to useless at getting accurate data. You're trying to input five or ten relevant ingredients, and then weighing your cooked outcome to try and divide the ingredients by proportion. Frankly it's a mess and most people aren't doing it for home cooked meals, and are getting very lossy outcomes (weighing cooked chicken and marking it as raw chicken, etc)

With reasoning and tool calling (combined with me meticulously weighing before and after), it's producing fine data for my purposes.

ijk 17 hours ago | parent | next [-]

I was complaining about AI generated clothes being misleading marketing, deceiving customers as to whether the garment even exists.

And then I learned that the pre-AI norms weren't any less fictional: they made an exemplar garment and did photoshoots, sure, but then they send the pictures and patterns to the lowest bidder factories with permission to make whatever edits are necessary to make it cheap and manufactureable. The whole thing was already a simulacrum.

smoe 17 hours ago | parent | prev [-]

I honestly think that, given the sorry state of the pre-GenAI internet, with all the SEO optimization nonsense, clickbait, and supplement peddling everywhere, LLMs are for now actually better than Google for “doing your own research” on many things.

At least at the entry level. Once you want to go in depth, the outcome in my experience is the same as with LLM use on any topic depends heavily on the domain knowledge of the prompter and their ability to steer it.

ozgung 18 hours ago | parent | prev [-]

The author uses the prompts and method from an open-source app that connects to insulin pump, a medical device. I think AI food identification is an experimental feature in the app.

> The prompt was adapted from the one used in the iAPS open-source automated insulin delivery system — it’s a real production prompt, not a toy example.

https://github.com/Artificial-Pancreas/iAPS

I think these are the prompts in the app: https://github.com/Artificial-Pancreas/iAPS/tree/5eabe22e7e2...

sjhatfield 17 hours ago | parent | next [-]

Exactly. This is not paid software. We assume full responsibility for outcomes when using it. There's a reason it's not on any app store. I'm glad features like this are being experimented with. Not how I would use AI to estimate carbs...

Ancapistani 16 hours ago | parent | prev [-]

True, but I'm working on a product that's "adjacent to" this sort of thing, and we also have a "food recognition" feature that's marked as experimental. Our users are using it, and now I plan to push fairly hard on at least measuring the accuracy and hopefully exposing those results to our users regardless of how well it performs.

andrewvc 18 hours ago | parent | prev | next [-]

One of the biggest gaps is that people don't understand that food labels are allowed by the FDA to be off by up to 20% in terms of the number of actual calories!

In the real world you need to calibrate your behavior with the results. Are you gaining weight? You'll need to eat less if you want to lose any. You can do all the math with nutrition labels and macros you want but that's all theoretical.

See this study below for the 20% figure, as well as their experimental results on real food items (some even exceeded this threshold though most were within it). https://pmc.ncbi.nlm.nih.gov/articles/PMC3605747/?st_source=...

bonoboTP 18 hours ago | parent | next [-]

A large part of the effectiveness in counting calories is that you pay more attention and make more conscious decisions and are less likely to "cheat" if you have to enter it in your food log.

It's indeed like astrology. Simply thinking about personality traits and thinking through your life and your desires and goals and current situation is already beneficial to take charge and navigate your life.

smallmancontrov 17 hours ago | parent | next [-]

I'd take the opposite point of view: just thinking about life, desires, and goals is how people wind up paying $10 to Whole Foods for a slice of "healthy pizza" that is nutritionally identical to any other pizza but comes out of a cute stone oven and is displayed on a wooden platform next to green leafy plants. Vibes astrology is notoriously easy to exploit, both by your own sugar/salt/fat-seeking instincts and by unscrupulous commercial forces, let alone the two working together. The unique thing about calorie counting is that it cannot be exploited like vibes astrology. Not even with a 20% error margin, which is (probably not coincidentally) the caloric deficit targeted by standard dieting advice.

ijk 17 hours ago | parent | prev [-]

That's an interesting bit, where reducing friction too much can eliminate the side effect that is actually driving the desired results.

Do you want to count calories, or do you want to lose weight? Sounds like it's possible to hyper-optimize calorie counting to the point that it becomes counter-productive...

johsole 16 hours ago | parent | prev | next [-]

Thank you for linking the study.

Some good news from it. If you weigh the food instead of depending on the package size then the labels become much more accurate!!

"Serving size, by weight, exceeded label statements by 1.2% [median] (25th percentile −1.4, 75th percentile 4.3, p=0.10). When differences in serving size were accounted for, metabolizable calories were 6.8 kcal (0.5, 23.5, p=0.0003) or 4.3% (0.2, 13.7, p=0.001) higher than the label statement."

If you look at the table "Deviation of metabolizable calories from label calories" [https://pmc.ncbi.nlm.nih.gov/articles/PMC3605747/figure/F1/] you'll see that most labels even for service side are pretty good and there are some that are really bad.

If you'll look at one of the worst offenders Tostitos, the label has "Tostitos Tortilla Chips - serving size 24 chips", but chips vary a lot in size, so you could have a huge variance in weight. If instead you weighed them, which I do with my chips, I bet the calories are much closer to the label.

Body composition comes down to routine. I've found found I love to eat, but I pretty much eat the same meals week over week, that makes it extremely easy for me to lose or gain weight depending on my goals.

adrian_b 16 hours ago | parent [-]

Moreover, the calorie numbers for raw ingredients are much more accurate than for snack foods, where the amount of each ingredient may vary from nominal, even when the total mass is nominal.

So when you cook yourself and you weigh the ingredients used for cooking, you can know the real calorie content with far more accuracy than when buying ready-to-eat food.

CGMthrowaway 14 hours ago | parent | prev | next [-]

The errors either cancel out (if error in both directions) or they work in one direction ("bad" foods like junk systematically underestimate calories, "good" foods like protein powder systematically overestimate calories).

Either way, if you count calories and compare to your weight gain/loss over a few weeks and adjust your calorie target as warranted, assuming the types of food you are eating do not change drastically (e.g. you calibrated on regular diet and now have started an elimination diet), the error bars can be basically ignored.

andrewvc 13 hours ago | parent [-]

Exactly! My point was that despite the precision of calories we should really think of them as ballpark estimates .

strken 17 hours ago | parent | prev | next [-]

Realistically the labels are going to be much closer for staples like long-grain Jasmine rice or olive oil, if they're measured by weight.

It's just not that easy to change the nutritional content of a kilogram of a known cultivar of dry rice when it's passed all the standard checks for moisture content, protein content, etc.

ndisn 17 hours ago | parent | prev | next [-]

20% doesn’t seem so bad?

And the more natural a food is the more inaccurate the results will be because of natural fluctuations. Think the amount of fat a chicken can have. So making this percent stricter will only benefit foods that are all chemicals.

The usual goyslop made of shit ingredients allows for very low tolerances. Mayo has some lab-grade soy oil, lab-grade yolk, and perhaps some lab-grade starch as thickener. Yipeee, we have a tolerance of 0.1% in calories. But how do you reach that level of accuracy with a roast chicken with no added stuff?

ludicrousdispla 17 hours ago | parent | prev | next [-]

how many calories are in strawberry?

nekusar 16 hours ago | parent | prev | next [-]

Its even weirder.

What has more calories: 1 lb of peanuts, OR 1 lb of peanuts ground into peanut butter?

I cant find the study, but the peanut butter has more calories since its pre-ground and more bioavailable. Peanuts get chomped up but larger pieces still remain and are not captured by the body.

stogot 17 hours ago | parent | prev [-]

Does this only apply to calories or to other categories too?

throwaw12 19 hours ago | parent | prev | next [-]

I feel like you didn't understand the goal of this study

> The DTN-UK stated earlier this year that generic LLMs must never be used as autonomous advisory calculators for insulin delivery. This data is the quantitative evidence base for that statement.

This study is to prove that you should not rely on LLMs

lukeschlather 17 hours ago | parent | next [-]

The thing is it doesn't really prove LLMs can't do this, it proves no existing frontier LLMs can do this.

The part where they talk about sampling multiple runs is interesting - it suggests to me that in the next few years as the reasoning process is improved the models may be able to do that autonomously.

My mind really is going to using a dedicated object detection models fine-tuned with nutrition information, but I don't think there's a fundamental reason LLMs can't eventually manage this use case, except perhaps the size of the needed weights being prohibitively large.

tsimionescu 16 hours ago | parent [-]

Per some people, LLMs of the future can do literally anything that's possible to do. They could create quantum computers powered by fusion power.

That has nothing to do with the question being asked, can you rely on an LLM today to help you track carbs as a diabetic?

This is very explicitly what the article is all about. Potential future LLMs are entirely irrelevant.

lukeschlather 13 hours ago | parent [-]

This isn't something so fanciful as fusion power, this is reasonably something that might be within the capabilities of object detection transformers. Whether a different prompt/finetuning with a good dataset could make this work is very relevant here.

The_Blade 18 hours ago | parent | prev | next [-]

that is good to know. presented this way i find LLM behavior to be a feature, not a bug. then again i think everything is value add over pen and paper / notepad / spreadsheet and maybe a friend or doctor (or specialized equipment if you need more than calorie in, calorie out). just go exercise and don't be a lard lad

devilbunny 16 hours ago | parent | next [-]

> just go exercise and don't be a lard lad

You can out-exercise almost any diet, but it takes 3-4 hours a day of a hard workout.

If calories in, calories out was useful advice rather than a banal statement of physics, nobody would be fat.

genewitch 11 hours ago | parent [-]

> If calories in, calories out was useful advice rather than a banal statement of physics

it's also wrong, or at least imprecise. fat and bone and muscle all weigh different amounts, at the same volume.

the only way i've been able to explain the science to laypersons is thus:

if you and a friend both weigh 200lbs, but you once weighed 250lbs and your friend has never weighed more than 200lbs; all else equal: you must ingest less calories than your friend to maintain 200lb body weight.

your body will try to "outlast the famine" if you had to lose a lot of weight (or lost a lot of weight for any reason).

That absolutely does not comport with "calories in, calories out". It's also why the people who were never fat have no problem "just eating a donut."

no, i won't cite, this has been published many times in the last decade, 13 years.

devilbunny 10 hours ago | parent [-]

Eh, it totally does comport with calories in, calories out. We just don't hook people up to metabolic carts in day-to-day life, but that's really the only way to measure the calories out part.

The physics of CICO are undeniable. Humans don't photosynthesize. The biology of CICO is useless.

- former fat guy. I'm deeply, personally aware of how useless CICO is as dietary advice.

tsimionescu 16 hours ago | parent | prev [-]

> just go exercise and don't be a lard lad

This is about people suffering from diabetes tracking their insulin needs. You can outrun any diet, but not insulin shots.

fabian2k 19 hours ago | parent | prev | next [-]

The paper itself is a lot clearer about the purpose. The blog post reads very clickbaity and doesn't really explain the context well.

Aurornis 18 hours ago | parent [-]

I disagree, it clearly explains that AI carb counting apps are a problem and shouldn’t be used.

They’re writing in a neutral way that reaches their audience without lecturing or being condescending. They lead the reader to the conclusion rather than shoving it at them. I think that’s why it’s triggering so many angry comments on HN, but it’s effective for the audience they’re writing for (non technical people who may need convincing but don’t like being preached at)

snapcaster 19 hours ago | parent | prev [-]

But it's stupid. If i smack myself in the head with a hammer is that proof hammers shouldn't be relied on?

fc417fc802 19 hours ago | parent | next [-]

If you smack yourself in the head with a hammer and it injures you that's evidence that smacking people in the head with hammers is bad and shouldn't be done, right?

jkestner 19 hours ago | parent | prev | next [-]

Here we’re at the origin of the tool and get to watch how many people hit themselves in the head before we learn this collective wisdom.

There’s a gap between what the tool will allow you to tell it to do, and what it’s good at. The feedback mechanism to tell the difference is deficient compared to a hammer.

coldtea 18 hours ago | parent | prev | next [-]

No, but it would be proof you didn't get the point of the paper.

19 hours ago | parent | prev | next [-]
[deleted]
jmye 19 hours ago | parent | prev [-]

Are there start-ups led by idiots suggesting that smacking yourself in the head with a hammer will help treat your diabetes?

If not, then perhaps there's a problem in your analogy.

Aurornis 19 hours ago | parent | prev | next [-]

> But the author just took pictures of food & expected a realistic response? Is this genuinely what amounts to a study in AI?

The article explains this: There are apps targeting people with diabetes that claim to count your carbs with AI.

> If you’re using AI carb counting in a diabetes app

Before you dismiss a study, try to understand where it’s coming from.

The authors of the study weren’t stupid. They knew the LLMs would provide poor results. They ran the study to quantify it and create a resource to spread the information in response to the rise of AI carb counting apps.

ijk 17 hours ago | parent | next [-]

> The authors of the study weren’t stupid. They knew the LLMs would provide poor results. They ran the study to quantify it and create a resource to spread the information in response to the rise of AI carb counting apps.

Yeah. I think it is under-appreciated that much of science is intended for debugging purposes. Sure, you and I know that X is positive, but what's it actual value? Can we find the causes that make it that way? Et cetera.

endymion-light 18 hours ago | parent | prev | next [-]

I don't believe the authors of this study are stupid.

If there are apps targeting people with diabetes that claims to count your carbs with AI, why haven't those been analysed? That would be a far more effective claim.

I based the study off of the clickbait article that they wrote about the study - i'll read through the study to see whether they analyse that, but it would be far more effective to see if the 'carb-counting' AI app is returning similiar results to the frontier model - that's an interesting result that actually can forward discussion.

Aurornis 18 hours ago | parent | next [-]

> If there are apps targeting people with diabetes that claims to count your carbs with AI, why haven't those been analysed? That would be a far more effective claim.

Because the apps aren’t going to let you submit 29,000 automated requests for statistical analysis.

And if you did, the authors of those apps would just release an update saying they changed models and try to dismiss the study.

The vitriol against this article on HN is sad. Commenters who agree with the article and its conclusions are grasping for reasons to be angry about it anyway

endymion-light 18 hours ago | parent [-]

You can commit statistical analysis on frontier models and still use commercial applications as an identifier & comparison.

Criticism is not vitriol - it's possible to make a wider point about being taken aback by the lack of education within AI to the point that there's a critical mass of people using them for calorie counting; but there are many studies on effects of LLMs on psychology etc that are far more effective.

But for me - this is like creating a study that performing algebra & calculus is innacurate on LLMs. That should be common knowledge

hrimfaxi 18 hours ago | parent | next [-]

It is not uncommon to study things that are considered common knowledge.

YeGoblynQueenne 17 hours ago | parent | prev [-]

Well, for me the comments that insist we don't need to study X because everybody knows LLMs can't do that is a very good justification to study exactly X.

Not to mention that this is now a standard thought-terminating cliché, where someone points out a use case where LLMs don't work at all well and irrate responses protest that LLMs aren't meant to be used in that way. Says who? If you ask an LLM a question and it answers it- then that's an LLM use case. If you can ask the same question many times and evaluate the results then that's an evaluation that is perfectly fine to make.

endymion-light 17 hours ago | parent [-]

Yes - my original claim is not to not study it, it's to study it deeper than just surface level, which is my belief at what I've read from the site linked

tsimionescu 16 hours ago | parent | prev [-]

The linked "click bait" article explains this very clearly as well. It clearly explains the methodology: they took the prompt sent to an LLM by a popular open source carb counting iOS app and sent it, together with five different pictures of food that a typical person might take, to all of the frontier models, and checked the responses. They also explain the purpose: to check the possible accuracy of this approach taken by a real app that real people use.

The fact that you somehow perceived this as an attack on LLMs as a technology is a failure entirely on your part. There is nothing in the article that suggests that people shouldn't use LLMs for other purposes - just a statistical verification of the fact that they shouldn't be used for this one particular thing.

endymion-light 12 hours ago | parent [-]

I didn't take anything as an attack on LLMs. I took it as a severe misunderstanding of how technology works. I specifically outline that I would like to see the margin of error even when integrating actual apps that claim to achieve results, rather than using tools that don't.

None of my claim perceives anything as an attack on LLMs, which shows a mischaracterisation on your part of my entire point.

ilivethere 18 hours ago | parent | prev [-]

Typical case of the "curse of knowledge". We deal with AI on a daily basis on the technical level, so it's very easy to forget that the "common" folk really still believe that AI can replace dieticians, gym coaches, etc

coldtea 18 hours ago | parent | prev | next [-]

>But the author just took pictures of food & expected a realistic response? Is this genuinely what amounts to a study in AI?

If there are commercial services where you take pictures of food and are promised a realistic (paid for) response, then yes. And there are.

dahart 18 hours ago | parent | next [-]

And what’s the variance & accuracy of their responses? Isn’t comparing the models’ variance to baseline human variance what matters here? It seems like they didn’t do that, and I agree with parent’s call for that kind of baseline.

Having counted calories for years, I don’t think I could reliably estimate the calories or carbs in the example picture of a cheese sandwich. I can make assumptions about the bread and the cheese, but I might easily be off by 2-3x. Calorie counting apps that use text descriptions also have huge variance for the same thing. The problem might be the belief that a picture or description is enough, regardless of who or what is guessing…?

Edit: Ah, I see from sibling thread you meant commercial services are LLMs, I thought you meant there were human-backed services to compare to. Anyway, I totally agree there’s a problem if people rely on AI for safety, but I’m not sure LLMs are the core issue here, it seems like using vague information and guessing is the core issue.

swiftcoder 18 hours ago | parent [-]

> Isn’t comparing the models’ variance to baseline human variance what matters here?

You seem to be missing the context that this isn't just about diet apps - this is about apps claiming to be able to track carbs sufficiently accurately to be used in a medical context to dose insulin (a substance which can be lethal if incorrectly dosed)

dahart 17 hours ago | parent [-]

No I understand apps are making dubious claims and implications; obviously claiming LLMs can accurately estimate carbs from a photo is just wrong. But that doesn’t necessarily change my question. Should people use photos to estimate carbs? Can people looking at photos do any better?

The presence of variance in the LLM output doesn’t actually prove anything, in fact I would expect and hope for variance when confidence is less than 1.0. I’m more curious about accuracy of the mean of guesses for different models, for example.

But should any diabetic expect photos to be reliable, regardless of whether it’s an app or an LLM or a human? I know some diabetics, and the people I know do not rely on photos for their safety. They don’t even rely on food labels either (which are far more accurate than photos), they measure their insulin.

It’s probably useful to raise awareness, and useful to scare app makers away from making bogus medical claims - products and scams that make bogus medical claims is of course a practice as old as history. But we can still hold the studies and PR around this up to high standards, right? Even assuming this article & the paper behind it are right, there are reasonable questions here about how to demonstrate the problem and what the baselines are.

It’s worth keeping in mind that trying to prove the bogus apps wrong with a flawed methodology or questionable reasoning or just an overly heavy handed style can cause backlash and do damage to the cause. We’re already seeing that effect play out with respect to vaccinations.

endymion-light 18 hours ago | parent | prev [-]

But I don't see them using those commercial services in this study - instead, they're using frontier model companies? Is Gemini advertising that you get a realistic calorie count from a picture? Maybe so - in which case i'd take it back!

notahacker 18 hours ago | parent | next [-]

The commercial services likely also have frontier model dependencies...

The opening to the actual paper is quite explicit that (i) other studies have already tested commercial apps with with unimpressive results and (ii) a popular open source app for carb counting directly relies on API calls from these frontier models, and this research batch tested the images used the exact same models and prompts as the popular open source app.

azakai 16 hours ago | parent [-]

A carb counting app might use API calls to these frontier models and then do some kind of analysis. It could see if different models agree or not, or multiple calls, and with how much variance.

So it would be more accurate to test the apps rather than the APIs, unless the goal is to warn people that just open chatgpt and ask there.

notahacker 14 hours ago | parent [-]

The open source app could in theory do that, but the paper's authors would be able to determine whether it did or not by reading its code, which they evidently did to replicate the API calls it made with their own script.

(And of course it would also be far more tedious to submit each picture 500 times manually using an app and manually log the response than using a script which is designed to collect the data automatically as fast as API rate limits permit)

coldtea 18 hours ago | parent | prev [-]

Are commercial services anything more than just UI facades on top of frontier model APIs?

endymion-light 18 hours ago | parent [-]

Great point - and i'd love a study to address that. If the study pointed out that X services sit perfectly within the analysis found, I think that would be a fantastic study that would be enlightening & useful to show.

swiftcoder 18 hours ago | parent [-]

The app the study is based on is open-source, so you yourself can verify that it does indeed just call a frontier model with the same prompts used in the study

endymion-light 17 hours ago | parent [-]

That's not really the same thing as what I'm saying - which is to investigate the applications specifically advertising AI calorie counting capabilities

notahacker 14 hours ago | parent [-]

They investigated an open source application specifically advertising carb counting capabilities, replicated its prompts and API calls in a way optimised to collect data from 26000 queries (which is a lot to do using a GUI!). They also note other people have already done [necessarily] smaller scale studies of the commercial AI carb counting apps and been similarly unimpressed by the responses.

This is all in the first few paragraphs of a preprint paper describing the research in considerably more detail which is linked at the bottom of TFA

Meta: enjoying nearly half this HN thread being arguments that surely people care about what's in their food don't ask ChatGPT for comment instead of looking it up properly, and most of the rest of it being people who apparently care what's in a research paper asking HN for comment instead of looking it up :)

swalsh 19 hours ago | parent | prev | next [-]

It amazes me how much people try to build AI systems relying on nothing more than the models knowledge. I suspect a great deal of "failed" AI experiments we keep reading are people just not having any idea how to use AI at what its good at.

nextlevelwizard 19 hours ago | parent | prev | next [-]

As someone who used to do this. OpenAI models refuse to look up calories unless you explicitly tell them to and even then it is a hit and miss even if you tell them exactly what the product is. Easiest way to get good calculation is to just take a photo of the nutrition label or feed that info in by hand.

Funny thing is 4o did look up calories but I guess it was too good for this world

the_duke 19 hours ago | parent [-]

I exclusively use thinking mode, which is slower but much more likely to double-check things with web search etc.

nextlevelwizard 19 hours ago | parent [-]

Maybe. I stopped using OpenAI a while ago. But taking pictures of the nutrition labels was good enough

datsci_est_2015 18 hours ago | parent | prev | next [-]

The obvious meme to invoke here is:

  - AI will solve all of our problems 
  - No not like that!
Are the trillion dollars sloshing around the AI economy well-invested if the refrain is always “you’re holding it wrong”?

So we’re trying to define, through trial and error, what problems “AI” will actually solve, and this paper is one of the many cobblestones on that road.

endymion-light 17 hours ago | parent [-]

i mean it's more like

"AI can solve this one problem, but it needs X, Y, Z, because it's not a omnipotent god entity"

"I tried it without any of those things and it didn't work - this is worthless tech!"

I don't know if more accurate calorie counting using AI exists - but it's like being upset that the screwdriver isn't gluing wood. AI is far more than frontier LLMs.

datsci_est_2015 16 hours ago | parent | next [-]

There are plenty of positions on the spectrum from “omnipotent God entity” and “Casio SL-300SV”. What does the current valuation of LLMaaS companies represent though?

LLMs are certainly not worthless, that’s a strawman in the same way my statement “AI will solve all of our problems” is a strawman. The question of their worth is being explored.

“AI is a black box that can solve problems”. Which problems? How consistently? At what cost? How quickly?

muwtyhg 16 hours ago | parent | prev | next [-]

In this case, what is the 'X, Y, Z' that the apps are providing that the model is not?

endymion-light 12 hours ago | parent [-]

Theoretically:

- A queryable large vector database containing calorie counts for specific meals.

- A vision model specifically trained on food images with labelled data containing approximate calorie counts.

- OCR model allowing reading of barcodes + calorie information.

A model trained to ask for additional context & information (e.g for pasta - please provide a photo of the original sauce tin/ect), (please approximate the weight of X meat)

I don't know how accurate integrating all of those aspects would be - and you could argue the end user would probably be incredibly annoyed and it wouldn't be a good app - but I'd argue you'd at least need that if you're developing an app for diabetes management.

sleepybrett 16 hours ago | parent | prev [-]

> "AI can solve this one problem, but it needs X, Y, Z, because it's not a omnipotent god entity"

0 advertisements from openai or anthropic say this. They all sell you an omnipotent god entity.

pertymcpert 16 hours ago | parent [-]

Skill issue in thinking.

giancarlostoro 19 hours ago | parent | prev | next [-]

> But the author just took pictures of food & expected a realistic response? Is this genuinely what amounts to a study in AI?

Reminds me of that one youtube video (I forget who it is so I have no idea how to pull it up) where he turns on the camera on his phone for ChatGPT and asks it what everything it sees weighs, then puts it on a scale, and ChatGPT was never right, ever, which makes sense, I couldnt tell you what most things weigh on sight alone either, but ChatGPT often got it dramatically off. I got the feeling he thought it was terrible AI for this, but I don't think a model looking at an image of something and trying to guess its weight / calories / etc... is a reason to call an AI model bad...

wat10000 18 hours ago | parent [-]

People do not understand this stuff at all. So many times I've seen people post the output from one of these services with full confidence that it has to be correct because it came from a computer.

Izkata 18 hours ago | parent [-]

Between the marketing and the hype, plenty of people believe we already achieved true sci-fi-style superhuman general AI a year or two ago.

jvanderbot 18 hours ago | parent | prev | next [-]

I did this too! For months (almost a year) I used descriptions, pictures, and measurements of food to get rough calorie counts. My diet is pretty simple and repetitive.

I would occasionally check the estimates, maybe once every few days for meals I wasn't already pretty sure of, and it was generally accurate. Where it was extremely inaccurate was on portions, and anyone who has dealt with computer vision could tell you, you can't get scale from a picture. So I'd have to weigh some meals or ingredients, which would generally make things more accurate again.

So, I think it's possible, but you need multimodal data and grounded with regular checks.

thechao 18 hours ago | parent [-]

About once a week I ask ChatGPT to give me a reasonable diet for recomp with weight loss. It consistently insisted I have at least 7 meals consisting of at least 30g of protein per meal, but the protein source can't be whey or casein. When I ask "why" it cites a bunch of studies ... but most of those "studies" are N=1 of a college or Olympic level athlete. If, instead, I grab a large scale lateral analysis, it says "3 meals" with about 1/2 of the protein.

It'll defend both sides (mutually contradictory) to the death. NOTHING will budge it from its initial stance.

ifwinterco 18 hours ago | parent [-]

To be fair this is a reflection of the general state of nutritional science and the actual answer seems to be "it depends on your genes".

Some people do well on 6 small meals, others do well on no breakfast and two large ones. Studies can't tell you anything useful about that, you have to experiment and find out what works best for you

machomaster 16 hours ago | parent [-]

The answer does not really depend on genes. There are personal preferences, there are sex differences (women prefer more carbs), and the biggest component is where you are and in which direction do you want to go to.

But in terms of physiology the answer is quite clear:

1. The protein is the most important macro to get, no matter if bulking or cutting. It is the building block.

2. Whatever the amounts (0.8g-1.8g/kg of bodyweight, depends a bit on a situation and the willingness to lose some potential marginal gains), try to divide your daily protein somewhat evenly between meals.

3. Pareto principle, you get the most benefits by having 3 meals. 4 if you really care about small differences and want to optimize. 5 meals give negligible additional benefits, for professional athletes who want to be anal.

4. So basically eat at least 3 meals and up to whatever works for you practically speaking.

It's not that difficult or ambiguous.

ifwinterco 11 hours ago | parent [-]

Optimum meal timing in particular I believe is heavily influenced by genes - I have friends who never eat breakfast, survive on black coffee until 1PM, then eat a lot in the evening and feel good doing it. If I do that I feel terrible.

So yes eat 2g/kg protein but the best way to time that in terms of meals, best specific foods to eat etc is definitely influenced by your genes

layer8 18 hours ago | parent | prev | next [-]

The author is doing what a non-sophisticated user would be doing, or would want to be able to do, and estimating calories based on a photo has been an often-cited potential or promised AI use case in recent years years. It makes a lot of sense to test current general-purpose AI’s performance on it as a reality check.

It also exemplifies how current AI offerings are still quite limited in their capabilities, because one would expect that they’d do the intelligent thing on their own that you had expected, instead of the user having to come up with a working methodology.

InsideOutSanta 19 hours ago | parent | prev | next [-]

There are apps in the app store right now that pretend to do this kind of thing, so having somebody actually show that it doesn't work is valuable, even if we already knew the outcome ahead of time.

endymion-light 18 hours ago | parent [-]

I suppose i'd much rather a study analyse the apps in the app store that are attempting and claiming to do that kind of thing - rather than the base model they might be using.

something765478 18 hours ago | parent | prev | next [-]

> The prompt I used asks each model to return a confidence score (0 to 1) for every food item it identifies. All four models dutifully returned confidence scores for 100% of items. Surely we can use those to filter out bad estimates?

This is a problem with the companies selling the AI models, not the customers. It is their responsibility to inform consumers about the limits of their services, and to train the models to say "I don't know, there is not enough information".

toasty228 17 hours ago | parent | prev | next [-]

> But the author just took pictures of food & expected a realistic response?

There are dozens of ios/android apps with 100-300k+ ratings and god knows how many millions of installs which do exactly this

"Cal AI - Food Calorie Tracker: Just snap a photo and our smart AI calorie tracker analyzes your meal instantly."

308k ratings on ios, 264k ratings on androids easily 5-10m installs across both platforms.

nclin_ 14 hours ago | parent | prev | next [-]

He was directly addressing apps that claim to do so, and proving that they can't, with laypeople who might have diabetes as the target audience.

It's more a data-driven pub test, I think it explains itself well.

The question is - are those apps actually so simplistic? Or is this a strawman.

endymion-light 12 hours ago | parent [-]

I guess that's partially my original point summarised; I'd like to see this criticism levied at the apps advertising it - rather than using something that doesn't

zipy124 19 hours ago | parent | prev | next [-]

Honestly it's scary how misunderstood this is by the general public, the media and EVEN scientists.

There is a shocking amount of Computer Vision tasks where the scientists claim you can get X info from a picture of Y and it's like, even with ML/AI you can't extract data where there isn't any. The fact I can add an arbritrary amount of high-calorie fat to a meal without changing the appearance by defintion shows it's pointless. A 1000 calorie and 100 calorie milkshake can look identical, and you'd have no way of working that out via an image even if it was a super-intelligent system.

Similarly I see it in things like extracting material of an object from an image of it in serious research papers, which for the same reason cannot be done, since how an object looks has very little to do with what its made of, else painting and other art would clearly be impossible. The information is just not there within the data.

mortenjorck 19 hours ago | parent [-]

It’s like CSI “enhance!” AI image upscaling. People will do it, see it fabricated details, and then draw the wrong lesson from it, that “AI fabricates things!” when that is exactly what they asked the model to do and there is no magic math that would extract ground truth that was never in the image to begin with.

whstl 16 hours ago | parent | prev | next [-]

> But the author just took pictures of food & expected a realistic response?

You say this (and I agree), but I know of quite a few companies in this area, including a couple accelerated by YCombinator, and that's pretty much 100% of what they do in their backend.

chromacity 17 hours ago | parent | prev | next [-]

> There's an incredibly serious lack of education with how LLMs & carb-counting works

Oh! Do the vendors offer trainings to make sure the users understand how LLMs work? If not, surely, the LLM itself is trained to know its limitations and politely decline in situations like that?...

The #1 use case for this tech is "here's a problem I don't feel like solving, let's have a computer do magic". It's how it's advertised on TV, it's how it promoted in the software I already use. Food preparation? Travel planning? Shopping? Tutoring your children? You can do anything now!

I just talked to a realtor who will make a killing on a real estate transaction. Instead of offering human insights, they sent me "AI reviews" of several properties. The AI has never been to any of these properties and has no idea how they actually look like. But I guess it's how we operate now as a society.

If you go to eBay, every other listing description for used items is AI-generated. This is an official platform feature for sellers. The AI doesn't know the condition of the item or what's included or missing. Doesn't matter, it's magic. It's AGI, it will figure it out.

Most of the uses of AI I encounter as a consumer are like that, and the companies selling this tech are 100% complicit.

endymion-light 17 hours ago | parent [-]

yes - the vast majority of labs offer a whole host of training material for users to understand how LLMs work. There's entire course websites created by each of the major vendors specifically to understand how LLMs work. Here's a couple of examples:

https://academy.openai.com/public/content

https://www.commonsense.org/education/articles/practical-tip...

Quote

>Getting the most out of generative AI depends on what you put in. To quote our Outreach team, "It's a tool, not magic!" As the technology evolves, more and more chatbots are designed for specific purposes.

https://www.anthropic.com/learn https://anthropic.skilljar.com/ai-fluency-framework-foundati...

https://grow.google/ai

All of the above are completely free, you could start 3 coursers today that specifically teach you how AI tools work in practice. yes, this is different from the marketing that some of these tools use, but these resources are there, free and available.

Maybe we need to create a form of driving license for responsible AI use, but saying the resources don't exist is not accurate

pertymcpert 16 hours ago | parent [-]

I expect crickets to your response.

joelthelion 15 hours ago | parent | prev | next [-]

In my opinion there is also a deficiency of the models who should be able to say "I don't know" when asked for something unreasonable.

mathgradthrow 16 hours ago | parent | prev | next [-]

a realistic response? What's a realistic response to "how many calories are in an avocado?"

If you are counting calories, you don't want the answer to "how many calories are in the average avocado?", you want to know how many calories are in this avocado. Remember that bodyweight is roughly linear with BMR, so a 10% error in calorie counting is an extra 10% of bodyweight.

macleginn 18 hours ago | parent | prev | next [-]

It doesn't really matter if the model cannot make a good educated guess about calories in the food if it cannot give a consistent response given the same input.

winddude 18 hours ago | parent | prev | next [-]

as a t-1 diabetic, this is exactly what we do for nearly every meal we eat, especially in restaurants, look at it and try to estimate the number of carbs.

SirMaster 17 hours ago | parent | prev | next [-]

How about instead of blaming the user for not understanding how AI works, the AI makers stop letting their chatbots answer questions so confidently that they clearly can't answer...

If I ask the AI about some health issue, it says something along the lines of warning I'm not a doctor etc. So if I show it a picture and ask it to tell me the carbs, how about a warning telling me it can try, but that it probably wont be very accurate.

ilivethere 19 hours ago | parent | prev | next [-]

> But the author just took pictures of food & expected a realistic response?

Outside our tech-enabled bubble, there are folks who have been sold the idea that ChatGPT et al is a miracle worker capable of replacing dieticians, gym coaches, psychologists, etc.

So it's VERY plausible to believe that there are folks out there snapping pics of their meals and asking GPT to spit out nutritional values.

endymion-light 18 hours ago | parent [-]

That's a good point - and I think there's a wider core lack of knowledge outside of the bubble.

I suppose I just expected this study to be a little less 'water is wet' which made me dissapointed, but that may be coming at it from a more technical perspective.

larodi 18 hours ago | parent | prev | next [-]

> This entire article would be better suited to astrology.com than hackernews.

I laughed, but you nailed it. Sadly so many people lack even basic understanding of LLMs and the ViT tower that makes it vLLM, that I expect a whole industry, similar to fortune telling, to emerge out of it.

black6 18 hours ago | parent | prev | next [-]

> There's an incredibly serious lack of education with how LLMs & carb-counting works

The public's education comes from the incessant marketing from AI companies that their models are the panacea for everything.

14 hours ago | parent | prev | next [-]
[deleted]
cyanydeez 15 hours ago | parent | prev | next [-]

So, do you think his methodology is closer to a computer scientist, or someone on instagram.

jrm4 16 hours ago | parent | prev | next [-]

This strikes me as a good "meta" article, though. As in, yes, people here probably don't need this. But perhaps a lot of other people do.

slumpt_ 16 hours ago | parent | prev | next [-]

This is how a lot of regular people are engaging with AI, whether you consider it silly or not.

sleepybrett 16 hours ago | parent | prev | next [-]

> There's an incredibly serious lack of education with how LLMs & carb-counting works. This entire article would be better suited to astrology.com than hackernews.

This is because the people who promote these technologies, and the companies that sell these technologies, engage in a massive amount of puffery (aka hyperbolizing aka just straight telling lies).

These technologies are painted as the magical solution to whatever problem you have (all it costs you is a few tens of thousands of tokens, aka your water supply). There is literally nothing they CAN'T do if you will just let us build these gigantic small town destroying, noise polluting, water and electricity hungry 'AI data-centers'. So that we can use those datacenters to sell you more tokens to put into their slot machines.

YeGoblynQueenne 17 hours ago | parent | prev | next [-]

>> But the author just took pictures of food & expected a realistic response? Is this genuinely what amounts to a study in AI?

The aim of the study was to understand the variation in results returned by models and how that could cause risks for patients using those models. The main result was measuring within-model variation.

From the pre-print (https://www.diabettech.com/wp-content/uploads/2026/04/diabet...):

We aimed to characterise the within-image reproducibility of carbohydrate estimates from four large language model (LLM) vision APIs and to quantify the clinical risk for insulin dosing, stratifying accuracy by reference value quality.

Methods

Thirteen food photographs were each submitted 495–561 times to four LLM vision APIs (GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Pro, Gemini 3.1 Pro Preview) using an identical structured prompt adapted from the iAPS automated insulin delivery system (26,904 total queries, temperature 0.01). The primary outcome was within- image variation (coefficient of variation [CV], range, distributional normality). Secondary outcomes included accuracy against reference values for nine images, stratified by quality tier (packet label, weighed/measured, portioned, or visual estimate). Clinical risk was translated at an insulin-to-carbohydrate ratio of 1:10.

>> I'd like to see this study done using any kind of actual grounding knowledge, seeing what mistakes AI makes when attempting to query ground truth from picture analysis - there would at least be an interesting result methodology in that

The ground truth was established by the author. There's an appendix in the pre-print (Appendix I) that describes the methodology. Methods are described in page 4 of the pre-print:

Reference values for accuracy analysis

For nine of the thirteen images, the author estimated the carbohydrate content using methods described in Appendix 1. Reference quality was categorised into four tiers:

Tier 1 (packet label): Carbohydrate values derived from manufacturer nutrition labelling. Two images (cheese sandwich, soup with bread) used bread with labelled carbohydrate content of 20 g per slice.

Tier 2 (weighed/measured): Portions directly weighed and cross-referenced with established composition data. Three images (Bakewell tart, bakery cookie, breakfast burrito).

Tier 3 (portioned): Portions estimated by the author (not weighed) and combined with USDA composition data. Three images (roast dinner, chilli con carne with rice, stuffed pork loin).

Tier 4 (visual estimate): Portions and composition estimated from visual inspection. One image (churros).

For the four restaurant dishes (pizza capricciosa, eggs benedict, crema catalana, paella), no reference value was established. These images were used for the primary reproducibility analysis only.

Carbohydrate values follow the EU convention with dietary fibre excluded.

17 hours ago | parent | prev | next [-]
[deleted]
jmyeet 18 hours ago | parent | prev | next [-]

> But the author just took pictures of food & expected a realistic response?

If someone sent me a picture of a meal and asked me what the macros were or how many carbs this is, I would say "I can't tell from a photo. Nobody can". The problem is that current LLM chatbots don't seem to have a concept of telling you "I don't know", "you can't do that" or even "you're wrong".

You can say that somebody shouldn't trust an LLM for this but it's going to be a problem that LLMs give nonsencial answers. What I find particularly amusing is that there are still technical people (generally, not anyone specifically) who seem unable to acknowledge that LLMs hallucinate and lie.

There was a post on here recently that I couldn't find with some quick searching but the premise basically was that chatbots were trained like neurotypical people: A lot of affirmation and basically lying. Separately someone else characterized this NT style of communication as "tone poems" [1]. I keep thinking about that because to me that's so accurate.

Dunning-Kruger is a common refrain on HN, for good reason. Another way to put this is how often people are confidently wrong. I really wonder if this is an inevitable consequence of NT communication because most neurodivergent ("ND") people I know are incredibly intentional in what they say and mean.

[1]: https://news.ycombinator.com/item?id=47832952

sarusso 18 hours ago | parent | prev | next [-]

[dead]

throwaway613746 19 hours ago | parent | prev [-]

[dead]