Remix.run Logo
godelski 2 days ago

  > I have seen the argument that LLMs can only give you what its been trained
There's confusing terminology here and without clarification people talk past one another.

"What its been trained on" is a distribution. It can produce things from that distribution and only things from that distribution. If you train on multiple distributions, you get the union of the distribution, making a distribution.

This is entirely different from saying it can only reproduce samples which it was trained on. It is not a memory machine that is surgically piecing together snippets of memorized samples. (That would be a mind bogglingly impressive machine!)

A distribution is more than its samples. It is the things between too. Does the LLM perfectly capture the distribution? Of course not. But it's a compression machine so it compresses the distribution. Again, different from compressing the samples, like one does with a zip file.

So distributionally, can it produce anything novel? No, of course not. How could it? It's not magic. But sample wise can it produce novel things? Absolutely!! It would be an incredibly unimpressive machine if it couldn't and it's pretty trivial to prove that it can do this. Hallucinations are good indications that this happens but it's impossible to do on anything but small LLMs since you can't prove any given output isn't in the samples it was trained on (they're just trained on too much data).

  > people have been telling me that LLMs cannot solve problems that is not in their training data already. Is this really true or not?
Up until very recently most LLMs have struggled with the prompt

  Solve:
  5.9 = x + 5.11
This is certainly in their training distribution and has been for years, so I wouldn't even conclude that they can solve problems "in their training data". But that's why I said it's not a perfect model of the distribution.

  > a pig with a dragon head
One needs to be quite careful with examples as you'll have to make the unverifiable assumption that such a sample does not exist in the training data. With the size of training data this is effectively unverifiable.

But I would also argue that humans can do more than that. Yes, we can combine concepts, but this is a lower level of intelligence that is not unique to humans. A variation of this is applying a skill from one domain into another. You might see how that's pretty critical to most animals survival. But humans, we created things that are entirely outside nature require things outside a highly sophisticated cut and paste operation. Language, music, mathematics, and so much more are beyond that. We could be daft and claim music is simply cut and paste of songs which can all naturally be reproduced but that will never explain away the feelings or emotion that it produces. Or how we formulated the sounds in our heads long before giving them voice. There is rich depth to our experiences if you look. But doing that is odd and easily dismissed as our own familiarity deceives us into our lack of.

XenophileJKO a day ago | parent | next [-]

The limit of a LLM "distribution" effectively is actually only at the token level though once the model has consumed enough language. Which is why those out of distribution tokens are so problematic.

From that point on the model can infer linguistics even on purely encountered words, concepts. I would even propose in context inferred meaning based on context, just like you would do.

It builds conceptual abstractions of MANY levels and all interrelated.

So imagine giving it a task like "design a car for a penguin to drive". The LLM can infer what kinda of input does a car need, what anatomy does a penguin have and it can wire it up descriptively. It is an easy task for an LLM. When you think about the other capabilities like introspection, and external state through observation (any external input), there really are not many fundamental limits on what they can do.

(Ignore image generation, it is an important distinction on how an image is made, end to end sequence vs. pure diffusion vs. hybrid.)

godelski 17 hours ago | parent [-]

I think you've confused some things. Pay careful note to what I'm calling a distribution. There are many distributions at play here but I'm referring to two specific ones that are clear from context.

I think you've also made a leap in logic. The jury's still out on whether LLMs have internalized some world model or not. It's quite difficult to distinguish memorization from generalization. It's impossible to do when the "test set" is spoiled.

You also need to remember that we train for certain attributes. Does the LLM actually have introspection or does it just appear that way because that's how it was optimized (which we definitely optimize it for that). Is there a difference? The duck test only lets us conclude something is probably a duck, not that it isn't a sophisticated animatronic that we just can't distinguish but someone or something else could.

astrange 2 days ago | parent | prev [-]

> This is entirely different from saying it can only reproduce samples which it was trained on. It is not a memory machine that is surgically piecing together snippets of memorized samples. (That would be a mind bogglingly impressive machine!)

You could create one of those using both a Markov chain and an LLM.

https://arxiv.org/abs/2401.17377

godelski 16 hours ago | parent [-]

Though I enjoyed that paper, it's not quite the same thing. There's a bit more subtly to what I'm saying. To do a surgical patching you'd have to actually have a rich understanding of language but just not have the actual tools to produce words themselves. Think like the SciFi style robots that pull together clips or recordings to speak. Bumblebee from transformers might be the most well known example. But think hard about that because it requires a weird set of conditions and a high level of intelligence to perform the search and stitching.

But speaking of Markov, we get that in LLMs through generation. We don't have conversations with them. Each chat is unique since you pass it the entire conversation. There's no memory. So the longer your conversations go the larger the token counts. That's Markovian ;)