Remix.run Logo
rvz 5 hours ago

I expect almost no-one to read the Gemini 3 model card. But here is a damning excerpt from the early leaked model card from [0]:

> The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data.

So your Gmails are being read by Gemini and is being put on the training set for future models. Oh dear and Google is being sued over using Gemini for analyzing user's data which potentially includes Gmails by default.

Where is the outrage?

[0] https://web.archive.org/web/20251118111103/https://storage.g...

[1] https://www.yahoo.com/news/articles/google-sued-over-gemini-...

inkysigma 5 hours ago | parent | next [-]

Isn't Gmail covered under the Workspace privacy policy which forbids using that for training data. So I'm guessing that's excluded by the "in accordance" clause.

andrewinardeer 3 hours ago | parent [-]

The real question is, "For how long?"

recitedropper 4 hours ago | parent | prev | next [-]

I'm pretty sure they mention in their various TOSes that they don't train on user data in places like Gmail.

That said, LLMs are the most data-greedy technology of all time, and it wouldn't surprise me that companies building them feel so much pressure to top each other they "sidestep" their own TOSes. There are plenty of signals they are already changing their terms to train when previously they said they wouldn't--see Anthropic's update in August regarding Claude Code.

If anyone ever starts caring about privacy again, this might be a way to bring down the crazy AI capex / tech valuations. It is probably possible, if you are a sufficiently funded and motivated actor, to tease out evidence of training data that shouldn't be there based on a vendor's TOS. There is already evidence some IP owners (like NYT) have done this for copyright claims, but you could get a lot more pitchforks out if it turns out Jane Doe's HIPAA-protected information in an email was trained on.

stefs 5 hours ago | parent | prev | next [-]

i'm very doubtful gmail mails are used to train the model by default, because emails contain private data and as soon as this private data shows up in the model output, gmail is done.

"gmail being read by gemini" does NOT mean "gemini is trained on your private gmail correspondence". it can mean gemini loads your emails into a session context so it can answer questions about your mail, which is quite different.

Yizahi 4 hours ago | parent | prev | next [-]

By the year 2025 I think most of the HN regulars and IT people in general are so jaded regarding privacy that it is not even surprising anyone. I suspect all gmails were analyzed and read from the beginning of google age, so nothing really changed, they might as well just admit it.

Google is betting that moving email and cloud is such a giant hassle that almost no one will do it, and ditching YT and Maps is just impossible.

aoeusnth1 5 hours ago | parent | prev [-]

This seems like a dubious conclusion. I think you missed this part:

> in accordance with Google’s relevant terms of service, privacy policy