Remix.run Logo
flyinglizard a day ago

Gmail has 1.8b active users, each with thousands of emails in their inbox. The number of emails they can train of is probably in the trillions.

brokencode a day ago | parent [-]

Email seems like not only a pretty terrible training data set, since most of it is marketing spam with dubious value, but also an invasion of privacy, since information could possibly leak about individuals via the model.

palmotea a day ago | parent [-]

> Email seems like not only a pretty terrible training data set, since most of it is marketing spam with dubious value

Google probably even has an advantage there: filter out everything except messages sent from valid gmail account to valid gmail account. If you do that you drop most of the spam and marketing, and have mostly human-to-human interactions. Then they have their spam filters.

Terr_ a day ago | parent [-]

I'd upgrade that "probably" leak to "will absolutely" leak, albeit with some loss of fidelity.

Imagine industrial espionage where someone is asking the model to roleplay a fictional email exchange between named corporate figures in a particular company.