▲ | heyjamesknight 12 hours ago | |
You misunderstand how the multimodal piece works. The fundamental unit of encoding here is still semantic. Not the same in your mind: you don’t need to know the word for sunset to experience the sunset. | ||
▲ | ninetyninenine 6 hours ago | parent [-] | |
No you misunderstand the ground truth reality. The LLM doesn’t need words as input. It can output pictures from pictures. Semantic words don’t have to be part of the equation at all. Also you have to note that serialized one dimensional string encodings are universal. Anything on the face of the earth and the universe itself can be encoded into a sting of just two characters: one and zero. That’s means anything can be translated to a linear series of symbols and the LLM can be trained on it. The LLM can be trained on anything. |