Guys you realize that you can go to ChatGPT right now and it can generate an actual picture of the night sky because it has seen thousands of pictures and drawings of the actual night sky right?

Your logic is flawed because your knowledge is outdated. LLMs are encoding visual data, not just “language” data.

▲

heyjamesknight 14 hours ago | parent [-]

You misunderstand how the multimodal piece works. The fundamental unit of encoding here is still semantic. Not the same in your mind: you don’t need to know the word for sunset to experience the sunset.

	▲	ninetyninenine 8 hours ago \| parent [-]
		No you misunderstand the ground truth reality. The LLM doesn’t need words as input. It can output pictures from pictures. Semantic words don’t have to be part of the equation at all. Also you have to note that serialized one dimensional string encodings are universal. Anything on the face of the earth and the universe itself can be encoded into a sting of just two characters: one and zero. That’s means anything can be translated to a linear series of symbols and the LLM can be trained on it. The LLM can be trained on anything.