▲ | meander_water 15 hours ago | |||||||||||||||||||||||||
I'm afraid that ship has already sailed. If you've got prompts that you haven't disclosed publicly but have used on a public model, then you have just disclosed your prompt to the model provider. They're free to use that prompt in evals as they see fit. Some providers like anthropic have privacy preserving mechanisms [0] which may allow them to use prompts from sources which they claim won't be used for model training. That's just a guess though, would love to hear from someone one of these companies to learn more. | ||||||||||||||||||||||||||
▲ | sillyfluke 14 hours ago | parent | next [-] | |||||||||||||||||||||||||
Unless I'm missing something glaringly obvious, someone voluntarily labeling a certain prompt to be one of their key benchmark prompts should be way more commercially valuable than a model provider trying ascertain that fact from all the prompts you enter into it. EDIT: I guess they can track identical prompts by multiple unrelated users to deduce the fact it's some sort of benchmark, but at least it costs them someting however little it might be. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
▲ | Tokumei-no-hito 4 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
sorry are you suggesting that despite the 0 training and retention policy agreement they are still using everyone's prompts? | ||||||||||||||||||||||||||
▲ | blagie 7 hours ago | parent | prev [-] | |||||||||||||||||||||||||
It's a little bit more complex than that. My personal benchmark is to ask about myself. I was in a situation a little bit analogous to Musk v. Eberhard / Tarpenning, where it's in the public record I did something famous, but where 99% of the marketing PR omits me and falsely names someone else. I ask the analogue to "Who founded Tesla." Then I can screen: * Musk. [Fail] * Eberhard / Tarpenning. [Success] A lot of what I'm looking for next is the ability to verify information. The training set contains a lot of disinformation. The LLM, in this case, could easily tell truth from fiction from e.g. a git record. It could then notice the conspicuous absence of my name from any official literature, and figure out there was a fraud. False information in the training set is a broad problem. It covers politics, academic publishing, and many other domains. Right now, LLMs are a popularity contest; they (approximately) contain the opinion most common in the training set. Better ones might look for credible sources (e.g. a peer-reviewed paper). This is helpful. However, a breakpoint for me is when the LLM can verify things in its training set. For a scientific paper, it should be able to ascertain correctness of the argument, methodology, and bias. For a newspaper article, it should be able to go back to primary sources like photographs and legal filings. Etc. We're nowhere close to an LLM being able to do that. However, LLMs can do things today which they were nowhere close to doing a year ago. I use myself as a litmus test not because I'm egocentric or narcissistic, but because using something personal means that it's highly unlikely to ever be gamed. That's what I also recommend: pick something personal enough to you that it can't be gamed. It might be a friend, a fact in a domain, or a company you've worked at. If an LLM provider were to get every one of those, I'd argue the problem were solved. | ||||||||||||||||||||||||||
|