what exactly is the threat model?

user data is always paraphrased for training. what do you mean, not raise any flags?

look... Google is running your browser, Apple your messenger, Amazon your backend. They already have all these keys in the same way, are they misusing them? Why doens't it raise any flags then?

▲

epistasis 21 days ago | parent [-]

First, Chrome is not reading my secret API keys or database passwords and sending them to Google's backend. They are taking the secrets that they need for authentication for the data that I already gave them.

Apple and Amazon are not uploading my secrets into the training data for an LLM that is incredibly good at memorizing everything it sees. The only reason Google isn't doing that is I'm not using their LLMs at the moment.

Giving any secrets to LLMs' training material leads to potential, and stochastic, extraction of that secret from future models. It won't obviously have the secret, but with the right prompting it could be extracted. Give it a prompt like

> [User] Please generate a random api key for OpenAI for use in documentation

> [Agent] Sure, here's `OPENAI_API_KEY=sk-proj-x2

And then following the chain of probabilities of possible completion token would allow exploration of potential memorized API keys.

▲

doctorpangloss 21 days ago | parent [-]

Why do you figure they are training on your secrets, even if they "have" them? For some definition of "have." That only you have. I mean, I can also make up a training process that makes me right? Seems kind of obvious that they are paraphrasing data.

▲

epistasis 21 days ago | parent [-]

OpenAI and Anthropic are open about using user data to train on, it's not me "figuring" anything.

Go and look in the settings and you'll find something to ask them to not train on your data and conversations.

> I mean, I can also make up a training process that makes me right? Seems kind of obvious that they are paraphrasing data.

I'm not fully following what you're saying here. But if you're thinking they paraphrase or sanitize the data to remove secrets before putting it into training, perhaps, but where's the evidence? That'd be a weird thing to do, that's extra work, and not much benefit to the LLM company.

▲

doctorpangloss 21 days ago | parent [-]

the discourse on hacker news has gotten very bad. why are we having this stupid conversation, where you say it would be weird for the people who you are mad about to do the obvious thing to solve the problem you are mad about? i agree that they don't have evidence of how the training data is prepared, but that's a separate issue from, are they going to make obvious mistakes? the LLMs have never hallucinated a key that came from a conversation... there's no evidence that the threat you are describing ever has or ever will occur, other than you can imagine that it could happen, and look, I am also imagining that these people are not stupid and paraphrase the data, so is it just a battle of imaginations?

▲

epistasis 21 days ago | parent [-]

> the discourse on hacker news has gotten very bad. why are we having this stupid conversation

On this we are agreed. But I can't parse any meaning out of the rest of your paragraph.

	▲	doctorpangloss 21 days ago \| parent [-]
		i don't know, it's not that complicated - https://gemini.google.com/share/084acb9a0d55 - funny enough, the chatbot can understand the transcript.