▲ | gooodvibes 6 days ago | |
This behavior comes from the later stages of training that turn the model into an assistant, you can't blame the original training data (ChatGPT doesn't sound like reddit or like Wikipedia even though it has both in its original data). | ||
▲ | morpheos137 3 days ago | parent [-] | |
It is shocking to me that 99% of people on YC news don't understand that LLMs encode tokens not verbatim training data. This is why I don't understand the NYT lawsuit against openAI. I can't see ChatGPT reproducing any text verbatim. Rather it is fine grained encoding of style in a multitude of domains. Again LLMs do not contain training data, they are a lossy compression of what the training data looks like. |