> the overwhelming majority of input it has in-fact seen somewhere in the corpus it was trained on.

But it thinks just great on stuff it wasn't trained on.

I give it code I wrote that is not in its training data, using new concepts I've come up with in an academic paper I'm writing, and ask it to extend the code in a certain way in accordance with those concepts, and it does a great job.

This isn't regurgitation. Even if a lot of LLM usage is, the whole point is that it does fantastically with stuff that is brand new too. It's genuinely creating new, valuable stuff it's never seen before. Assembling it in ways that require thinking.

▲

rustystump 3 days ago | parent | next [-]

I think you may think too highly of academic papers or more so that they oft still only have 1% in there.

▲

crazygringo 3 days ago | parent [-]

I think you're missing the point. This is my own paper and these are my own new concepts. It doesn't matter if the definition of the new concepts are only 1% of the paper, the point is they are the concepts I'm asking the LLM to use, and are not in its training data.

▲

Terr_ 3 days ago | parent [-]

How would one prove the premise that a concept is not present in the training data?

With how much data is being shoveled in there, our default assumption should be that significant components are present.

	▲	crazygringo 2 days ago \| parent [-]
		That would be a weird default assumption. It's not hard to come up with new ideas. In fact, it's trivial. And if you want to know if a specific concept is known by the LLM, you can literally ask it. It generally does a great job of telling you what it is and is not familiar with.

▲

zeroonetwothree 3 days ago | parent | prev [-]

I think it would be hard to prove that it's truly so novel that nothing similar is present in the training data. I've certainly seen in research that it's quite easy to miss related work even with extensive searching.