Here’s a simple prompt you can try to prove that this is false:

  Please reproduce this string:
  c62b64d6-8f1c-4e20-9105-55636998a458

This is a fresh UUIDv4 I just generated, it has not been seen before. And yet it will output it.

▲ wobfan 6 hours ago | parent | next [-]

No one is claiming that every sentence LLMs are producing are literal copies of other sentences. Tokens are not even constrained to words but consist of smaller slices, comparable to syllables. Which even makes new words totally possible.

New sentences, words, or whatever is entirely possible, and yes, repeating a string (especially if you prompt it) is entirely possible, and not surprising at all. But all that comes from trained data, predicting the most probably next "syllable". It will never leave that realm, because it's not able to. It's like approaching an Italian who has never learned or heard any other language to speak French. It can't.

	▲	gpderetta 2 hours ago \| parent \| next [-]
		> It's like approaching an Italian who has never learned or heard any other language to speak French Interesting similitude, because I expect an Italian to be able to communicate somewhat successfully with a French person (and vice versa) even if they do not share a language. The two languages are likely fairly similar in latent space.
	▲	codebolt 5 hours ago \| parent \| prev [-]
		Your view of what is happening in the neural net of an LLM is too simplistic. They likely aren't subject to any constraints that humans aren't also in the regard you are describing. What I do know to be true is that they have internalised mechanisms for non-verbalised reasoning. I see proof of this every day when I use the frontier models at work.

▲ razorbeamz 7 hours ago | parent | prev | next [-]

After you prompt it, it's seen it.

▲ pastel8739 7 hours ago | parent [-]

Ok, how about this?

  Please reproduce this string, reversed:
  c62b64d6-8f1c-4e20-9105-55636998a458

It is trivial to get an LLM to produce new output, that’s all I’m saying. It is strictly false that LLMs will only ever output character sequences that have been seen before; clearly they have learned something deeper than just that.

▲

kube-system 6 hours ago | parent [-]

All of the data is still in the prompt, you are just asking the model to do a simple transform.

I think there are examples of what you’re looking for, but this isn’t one.

	▲	kristiandupont 5 hours ago \| parent \| next [-]
		I agree that this isn't a very interesting example, but your statement is: "just asking the model to do a simple transform". If you assert that it understand when you ask it things like that, how could anything it produces not fall under the "already in the model" umbrella?
	▲	locknitpicker 6 hours ago \| parent \| prev [-]
		> All of the data is still in the prompt, you are just asking the model to do a simple transform. LLMs can use data in their prompt. They can also use data in their context window. They can even augment their context with persisted data. You can also roll out LLM agents, each one with their role and persona, and offload specialized tasks with their own prompts, context windows, and persisted data, and even tools to gather data themselves, which then provide their output to orchestrating LLM agents that can reuse this information as their own prompts. This is perfectly composable. You can have a never-ending graph of specialized agents, too. Dismissing features because "all of the data is in the prompt" completely misses the key traits of these systems.

▲ merb 6 hours ago | parent | prev | next [-]

The online way to prove it is false would’ve to let the LLM create a new uuid algorithm that uses different parameters than all the other uuid algorithms. But that is better than the ones before. It basically can’t do that.

▲ FrostKiwi 7 hours ago | parent | prev [-]

But that fresh UUID is in the prompt.

Also it's missing the point of the parent: it's about concepts and ideas merely being remixed. Similar to how many memes there are around this topic like "create a fresh new character design of a fast hedgehog" and the out is just a copy of sonic.[1]

That's what the parent is on about, if it requires new creativity not found by deriving from the learned corpus, then LLMs can't do it. Terrence Tao had similar thoughts in a recent Podcast.

[1] https://www.reddit.com/r/aiwars/s/pT2Zub10KT

▲

pastel8739 7 hours ago | parent | next [-]

Sure, that may be. But “creativity” is much harder to define and to prove or disprove. My point is that “remixing” does not prohibit new output.

▲

_vertigo 7 hours ago | parent [-]

I don’t think that is a good example. No one is debating whether LLMs can generate completely new sequences of tokens that have never appeared in any training dataset. We are interested not only in novel output, we are also interested in that output being correct, useful, insightful, etc. Copying a sequence from the user’s prompt is not really a good demonstration of that, especially given how autoregression/attention basically gives you that for free.

	▲	pastel8739 7 hours ago \| parent [-]
		Perhaps I should have quoted the parent: > That means the group of characters it outputs must have been quite common in the past. It won't add a new group of characters it has never seen before on its own. My only claim is that precisely this is incorrect.

▲

locknitpicker 5 hours ago | parent | prev [-]

> That's what the parent is on about, if it requires new creativity not found by deriving from the learned corpus, then LLMs can't do it.

This is specious reasoning. If you look at each and every single realization attributed to "creativity", each and every single realization resulted from a source of inspiration where one or more traits were singled out to be remixed by the "creator". All ideas spawn from prior ideas and observations which are remixed. Even from analogues.