Remix.run Logo
freedomben 9 hours ago

> Despite having access to my weight, blood pressure and cholesterol, ChatGPT based much of its negative assessment on an Apple Watch measurement known as VO2 max, the maximum amount of oxygen your body can consume during exercise. Apple says it collects an “estimate” of VO2 max, but the real thing requires a treadmill and a mask. Apple says its cardio fitness measures have been validated, but independent researchers have found those estimates can run low — by an average of 13 percent.

There's plenty of blame to go around for everyone, but at least for some of it (such as the above) I think the blame more rests on Apple for falsely representing the quality of their product (and TFA seems pretty clearly to be blasting OpenAI for this, not others like Apple).

What would you expect the behavior of the AI to be? Should it always assume bad data or potentially bad data? If so, that seems like it would defeat the point of having data at all as you could never draw any conclusions from it. Even disregarding statistical outliers, it's not at all clear what part of the data is "good" vs "unrealiable" especially when the company that collected that data claims that it's good data.

brandonb 9 hours ago | parent | next [-]

FWIW, Apple has published validation data showing the Apple Watch's estimate is within 1.2 ml/kg/min of a lab-measured Vo2Max.

Behind the scenes, it's using a pretty cool algorithm that combines deep learning with physiological ODEs: https://www.empirical.health/blog/how-apple-watch-cardio-fit...

itchyouch 8 hours ago | parent | next [-]

The trick with the vo2 max measurement on the apple watch though is that the person can not waste any time during their outdoor walk and needs to maintain a brisk pace.

Then there's confounders like altitude, elevation gain that can sully the numbers.

It can be pretty great, but it needs a bit of control in order to get a proper reading.

ignoramous 8 hours ago | parent | prev [-]

The paper itself: https://www.apple.com/healthcare/docs/site/Using_Apple_Watch...

Seems like Apple's 95% accuracy estimate for VO2 max holds up.

  Thirty participants wore an Apple Watch for 5-10 days to generate a VO2 max estimate. Subsequently, they underwent a maximal exercise treadmill test in accordance with the modified Åstrand protocol. The agreement between measurements from Apple Watch and indirect calorimetry was assessed using Bland-Altman analysis, mean absolute percentage error (MAPE), and mean absolute error (MAE).

  Overall, Apple Watch underestimated VO2 max, with a mean difference of 6.07 mL/kg/min (95% CI 3.77–8.38). Limits of agreement indicated variability between measurement methods (lower -6.11 mL/kg/min; upper 18.26 mL/kg/min). MAPE was calculated as 13.31% (95% CI 10.01–16.61), and MAE was 6.92 mL/kg/min (95% CI 4.89–8.94).

  These findings indicate that Apple Watch VO2 max estimates require further refinement prior to clinical implementation. However, further consideration of Apple Watch as an alternative to conventional VO2 max prediction from submaximal exercise is warranted, given its practical utility.
https://pmc.ncbi.nlm.nih.gov/articles/PMC12080799/
aeonfox 9 hours ago | parent | prev | next [-]

> I think the blame more rests on Apple for falsely representing the quality of their product

There was plenty of other concerning stuff in that article. And from a quick read it wasn't suggested or implied the VO2 max issue was the deciding factor for the original F score the author received. The article did suggest many times over the ChatGPT is really not equipped for the task of health diagnosis.

> There was another problem I discovered over time: When I tried asking the same heart longevity-grade question again, suddenly my score went up to a C. I asked again and again, watching the score swing between an F and a B.

brandonb 9 hours ago | parent [-]

The lack of self-consistency does seem like a sign of a deeper issue with reliability. In most fields of machine learning robustness to noise is something you need to "bake in" (often through data augmentation using knowledge of the domain) rather than get for free in training.

jayd16 8 hours ago | parent | prev | next [-]

Well if it doesn't know the quality of the data and especially if it would be dangerous to guess then it should probably say it doesn't have an answer.

AndrewKemendo 8 hours ago | parent | prev | next [-]

> Should it always assume bad data or potentially bad data? If so, that seems like it would defeat the point of having data at all as you could never draw any conclusions from it.

Yes. You, and every other reasoning system, should always challenge the data and assume it’s biased at a minimum.

This is better described as “critical thinking” in its formal form.

You could also call it skepticism.

That impossibility of drawing conclusions assumes there’s a correct answer and is called the “problem of induction.” I promise you a machine is better at avoiding it than a human.

Many people freeze up or fail with too much data - put someone with no experience in front of 500 ppl to give a speech if you want to watch this live.

hmokiguess 9 hours ago | parent | prev | next [-]

I have been sitting and waiting for the day these trackers get exposed as just another health fad that is optimized to deliver shareholder value and not serious enough for medical grade applications

NoPicklez 9 hours ago | parent [-]

I don't see how they are considered a health fad, they're extremely useful and accurate enough. There are plenty of studies and real world data showing Garmin VO2Max readings being accurate to 1-2 points different to a real world test.

There is this constant debate about how accurately VO2max is measured and its highly dependent on actually doing exercise to determine your VO2max using your watch. But yes if you want a lab/medically precise measure you need to do it a test that measures your actual oxygen uptake.

miltonlost 9 hours ago | parent | prev [-]

> What would you expect the behavior of the AI to be? Should it always assume bad data or potentially bad data? If so, that seems like it would defeat the point of having data at all as you could never draw any conclusions from it.

Well, I would expect the AI to provide the same response as a real doctor did from the same information. Which the article went over the doctors were able to.

I also would expect the AI to provide the same answer every time to the same data unlike what it did (from F to B over multiple attempts in the article)

OpenAI is entirely to blame here when they are putting out faulty products, (hallucinations even on accurate data are a fault of them).

jdub 7 hours ago | parent [-]

Why do you have those expectations?