Remix.run Logo
heyjamesknight 3 days ago

But language is the input and the vector space within which their knowledge is encoded and stored. The don't have a concept of a duck beyond what others have described the duck as.

Humans got by for millions of years with our current biological hardware before we developed language. Your brain stores a model of your experience, not just the words other experiencers have shared with yiu.

embedding-shape 3 days ago | parent [-]

> But language is the input and the vector space within which their knowledge is encoded and stored. The don't have a concept of a duck beyond what others have described the duck as.

I guess if we limit ourselves to "one-modal LLMs" yes, but nowadays we have multimodal ones, who could think of a duck in the way of language, visuals or even audio.

deadbabe 3 days ago | parent [-]

You don’t understand. If humans had no words to describe a duck, they would still know what a duck is. Without words, LLMs would have no way to map an encounter with a duck to anything useful.

embedding-shape 2 days ago | parent [-]

Which makes sense for text LLMs yes, but what about LLMs that deal with images? How can you tell they wouldn't work without words? It just happens to be words we use for interfacing with them, because it's easy for us to understand, but internally they might be conceptualizing things in a multitude of ways.

heyjamesknight 2 days ago | parent [-]

Multimodal models aren't really multimodal. The images are mapped to words and then the words are expanded upon by a single mode LLM.

If you didn't know the word "duck", you could still see the duck, hunt the duck, use the ducks feather's for your bedding and eat the duck's meat. You would know it could fly and swim without having to know what either of those actions were called.

The LLM "sees" a thing, identifies it as a "duck", and then depends on a single modal LLM to tell it anything about ducks.

embedding-shape 2 days ago | parent [-]

> Multimodal models aren't really multimodal. The images are mapped to words and then the words are expanded upon by a single mode LLM.

I don't think you can generalize like that, it's a big category, not all multimodal models work the same, it's just a label for a model that has multiple modalities after all, not a specific architecture of machine learning models.