Architectures have evolved significantly since then. DeepSeek v4 =/= GPT-3. Even then, a great deal of complexity lies in everything surrounding the architectures e.g. how do you implement them performantly on modern accelerators, how do you distribute the model across a set of accelerators, how do you post-train, etc. And pre-training itself is a dark art. If you legitimately think that frontier labs are doing something equivalent to whatever you wrote on your whiteboard, you’re clueless.

▲

jumploops 5 hours ago | parent [-]

Those are all just optimizations.

We still don’t really know why they work, we just know how to build them.

▲

trollbridge 4 hours ago | parent | next [-]

We don't really know why language works with humans, either. If you raise a baby from birth, you kind of observe how it is learning language, but the process is also rather mysterious. My eldest son's first word was to actually imitate a cow mooing, and then after that to imitate a motor noise of a tractor or truck. And then after that a meow. (His first complete sentence was "King Graham fell"...)

My next child took a completely different path to language, including skipping all the non-verbal imitations.

And then at some point, you just suddenly can two-way communicate with them when you couldn't before, and then after that, they can engage in reasoning.

▲

jumploops 4 hours ago | parent | next [-]

Completely agree!

It’s interesting to me how similar attempting to understand LLMs is to neuroscience.

“When we turn this bit off, this other thing happens… if we change these weights the Eiffel Tower is now in Rome”

We’re basically just probing around and trying to reverse engineer an emergent system.

To your point, this system may be quite different from model to model (human to human) although some similarities likely occur.

The comment I was responding to tried to belittle the OP’s understanding of transformers, by mentioning that running an LLM at scale is much harder than the simple white board diagram.

My point was simply that we don’t know why they work, and all the extra optimizations isn’t the “thing” that makes it emergent.

Simply scaling the “GPT” is good enough to see it, so the OP’s awe should stand.

(On a side note, what other architectures can we scale to find similar emergent behavior?)

▲

ai_slop_hater 3 hours ago | parent | prev [-]

Human brain capabilities are truly amazing, imagine if people didn’t treat their children as if they are stupid and didn’t constantly lie to them, because kids are stupid right, they wouldn’t understand. What heights could be reached.

▲

baq 3 hours ago | parent | next [-]

We don’t treat children like they’re stupid, we treat children like they’re children. A stupid adult is treated very differently than any child.

Adults are expected to have their world models approximately correct in terms of physical environment so they won’t accidentally kill themselves by falling off a cliff; then there are the social norms which adults are expected to conform to so everyone is kinda predictable to everyone else so adults don’t kill each other too often over food or mates. Understanding of neither is expected from children.

▲

ai_slop_hater 3 hours ago | parent [-]

You may have been raised properly since you don’t get what I mean. I really envy kids with “Chinese parents” that had them learn math early on and not some bullshit like that if you put your tooth under your pillow, then a tooth fairy will come.

▲

mejutoco 2 hours ago | parent | next [-]

I think those 2 are orthogonal. Math still works with Santa or the tooth fairy.

	▲	ai_slop_hater 2 hours ago \| parent [-]
		Maybe math works but critical thinking doesn’t. There are people who have lived for many decades without ever questioning insane b.s. they were taught as kids.

▲

beezlewax 2 hours ago | parent | prev [-]

It is possible to have learned both things you know.

▲

pmg101 3 hours ago | parent | prev [-]

Because god forbid that childhood, the one time in your life when you don't have any responsibilities, should be fun.

	▲	ai_slop_hater 2 hours ago \| parent [-]
		Waste 22 years of life without learning anything and then slave away at a 9-5 job you hate. Brilliant strategy. At least you had “fun”. Then blame billionaires or something.

▲

otabdeveloper4 3 hours ago | parent | prev | next [-]

We do know how they work. They predict the next statistically most likely token.

The "bitter lesson" is that fake-it-till-you-make-it is a valid way of doing knowledge work.

(Or not make it, then people will just claim you're holding the LLM wrong and it's not the AI's fault.)

	▲	throw310822 2 hours ago \| parent [-]
		> statistically most likely token. Statistically most likely in what context, given which preconditions? Because each prompt sequence is unique so the probability of any token following it is unknown.

▲

slopinthebag 4 hours ago | parent | prev [-]

Hm, I wonder if it's more that we're shocked such a simple thing (relatively speaking) can work so well.