> Since then we've had "sparsely-gated MoE", RLHF, BERT, "Scaling Laws", Dall-E, LoRA, CoT, AlphaFold 2, "Parameter-Efficient Fine-Tuning", and DeepSeek's training cost breakthrough.

OK, I will bite.

So "Sparsely-gated MoE" isn’t some new intelligence, it's a sharding trick. You trade parameter count for FLOPs/latency with a router. And MoE predates transformers anyway.

RLHF is packaging. Supervised finetune on instructions, learn a reward model, then nudge the policy. That’s a training objective swap plus preference data. It's useful, but not breakthrough.

CoT is a prompting hack to force the same model to externalize intermediate tokens. The capability was there, you’re just sampling a longer trajectory. It’s UX for sampling.

Scaling laws are an empirical fit telling you "buy more compute and data" That’s a budgeting guideline, not new math or architecture. https://www.reddit.com/r/ProgrammerHumor/comments/8c1i45/sta...

LoRA is linear algebra 101, low rank adapters to cut training cost and avoid touching the full weights. The base capability still comes from the giant pretrained transformer.

AlphaFold 2’s magic is mostly attention + A LOT of domain data/priors (MSAs, structures, evolutionary signal). Again attention core + data engineering.

"DeepSeek’s cost breakthrough" is systems engineering.

Agentic software dev/MCP is orchestration, that’s middleware and protocols, it helps use the model, it doesn’t make the model smarter.

Video generation? Diffusion with temporal conditioning and better consistency losses. It’s DALL-E style tech stretched across time with tons of data curation and filtering.

Most headline "wins" are compiler and kernel wins: FlashAttention, paged KV-cache, speculative decoding, distillation, quantization (8/4 bit), ZeRO/FSDP/TP/PP... These only move the cost curve, not the intelligence.

The biggest single driver the last few years has been the data so de dup, document quality scores, aggressive filtration, mixture balancing (web/code/math), synthetic bootstrapping, eval driven rewrites etc etc. You can swap half a dozen training "tricks" and get similar results if your data mix and scale are right.

For me a real post attention "breakthrough", would be something like: training that learns abstractions with sample efficiency far beyond scaling laws, reliable formal reasoning, causal/world-model learning that transfers out of distribution. None of the things you listed do that.

Almost everything since attention is optimization, ops, and data curation. I mean give me exact pretrain mix, filtering heuristics, and finetuning datasets for Claude/GPT-5 and without peeking at the secret sauce architecture I can get close just by matching tokens, quality filters and training schedule. The "breakthroughs" are mostly better ways to spend compute and clean data, not new ways to think.

▲

kianN 2 days ago | parent | next [-]

This is a great summary of why despite so much progress/tricks being discovered, so little progress to the core limitations to LLMs are made.

▲

kragen 2 days ago | parent | prev | next [-]

I don't disagree with any of this, though it sounds like you know more about it than I do.

▲

BobbyTables2 2 days ago | parent | prev [-]

Indeed. I’m shocked that we train “AI” pretty much as one would build a fancy auto-complete.

Not necessarily a bad approach but feels like something is missing for it to be “intelligent”.

Should really be called “artificial knowledge” instead.

▲

jofla_net 2 days ago | parent | next [-]

This and parent are both approaching toward what I see as the main obstacle, that we as a species don't know how in its entirety a human mind thinks (and it varies among people), so trying to "model" it and reproduce it is reduced to a game of black-boxing. We black box the mind in terms of what situations its been seen to be in and how it has performed, the millions of correlative inputs/outputs are the training data. Yet, since we don't know the fullness of the interior we can only see its outputs it becomes somewhat of a Plato's cave situation. We believe it 'thinks' this way but again we cannot empirically say it performed a task a certain way, so unlike most other engineering problems, we are grasping at straws while trying to reconstruct it. This doesn't not mean that a human mind's inner-workings can't ever be %100 reproduced, but not until we know it further.

	▲	tempodox 2 days ago \| parent [-]
		And there is another important difference: Our environments have oodles of details that inform us, while LLM training data is just “everything humans have ever written”. Those are completely different things. And LLMs have no concept of facts, only statements about facts in their training data that may or may not be true.

▲

kragen 2 days ago | parent | prev [-]

"What do you mean, they talk?"

"They talk by flapping their meat at each other!"