> Any thinking that happens with words is fundamentally no different from what LLMs do.

This is such a wildly simplified and naive claim. "Thinking with words" happens inside a brain, not inside a silicon circuit with artificial neurons bolted in place. The brain is plastic, it is never the same from one moment to the next. It does not require structured input, labeled data, or predefined objectives in order to learn "thinking with words." The brain performs continuous, unsupervised learning from chaotic sensory input to do what it does. Its complexity and efficiency are orders of magnitude beyond that of LLM inference. Current models barely scratch the surface of that level of complexity and efficiency.

> Do you have a concept of one-ness, or two-ness, beyond symbolic assignment?

Obviously we do. The human brain's idea of "one-ness" or "two-ness" is grounded in sensory experience — seeing one object, then two, and abstracting the difference. That grounding gives meaning to the symbol, something LLMs don't have.

▲

gkbrk 2 days ago | parent | next [-]

LLMs are increasingly trained on images for multi-modal learning, so they too would have seen one object, then two.

▲

gloosx 2 days ago | parent [-]

They never saw any kind of object, they only saw labeled groups of pixels – basic units of a digital image, representing a single point of color on a screen or in a digital file. Object is a material thing that can be seen and touched. Pixels are not objects.

▲

gkbrk 2 days ago | parent | next [-]

Okay, goalpost has instantly moved from seeing to "seeing and touching". Once you feed in touch sensor data, where are you going to move the goalpost next?

Models see when photons hit camera sensors, you see when photons hit your retina. Both of them are some kind of sight.

	▲	gloosx a day ago \| parent [-]
		The difference between photons hitting the camera sensors and photons hitting the retina is immense. With a camera sensor, the process ends in data: voltages in an array of photodiodes get quantized into digital values. There is no subject to whom the image appears. The sensor records but it does not see. When photons hit the retina, the same kind of photochemical transduction happens — but the signal does not stop at measurement. It flows through a living system that integrates it with memory, emotion, context, and self-awareness. The brain does not just register and store the light, it constructs an experience of seeing, a subjective phenomenon — qualia. Once models start continuously learning from visual subjective experience, hit me up – and I'll tell you the models "see objects" now. Until direct raw photovoltaic information stream about the world around them without any labelling can actually make model to learn anything, they are not even close to "seeing".

▲

madaxe_again 2 days ago | parent | prev [-]

My friend, you are blundering into metaphysics here - ceci n’est pas une pipe, the map is the territory, and all that.

We are no more in touch with physical reality than an LLM, unless you are in the habit of pressing your brain against things. Everything is interpreted through a symbolic map.

	▲	gloosx a day ago \| parent [-]
		when photons strike your retina, they are literally striking brain tissue that is been pushed outward into the skull front window. Eyes are literally the brain, so yes, we are pressing it against things to "see" them.

▲

madaxe_again 2 days ago | parent | prev [-]

The instantiation of models in humans is not unsupervised, and language, for instance, absolutely requires labelled data and structured input. The predefined objective is “expand”.

See also: feral children.

	▲	gloosx a day ago \| parent [-]
		Children are not shown pairs like "dog": [object of class Canine] They infer meaning from noisy, ambiguous sensory streams. The labels are not explicit, they are discovered through correlation, context, and feedback. So although caregivers sometimes point and name things, that is a tiny fraction of linguistic input, and it is inconsistent. Children generalize far beyond that. Real linguistic input to a child is incomplete, fragmented, error-filled, and dependens on context. It is full of interruptions, mispronunciations, and slang. The brain extracts structure from that chaos. Calling that "structured input" confuses the output - inherent structure of language - with the raw input, noisy speech and gestures. The brain has drives: social bonding, curiosity, pattern-seeking. But it doesn't have a single optimisation target like "expand." Objectives are not hardcoded loss functions, they are emergent and changing. You're right that lack of linguistic input prevents full language development, but that is not evidence of supervised learning. It just shows that exposure to any language stream is needed to trigger the innate capacity. Both complexity and efficiency of the human learning is just on another level. Transformers are child's play compared to that level. They are not going to gain consciousness, and no AGI will happen in the foreseeable future, it is all just marketing crap, and it's becoming more and more obvious as the dust settles.