Remix.run Logo
How LLMs work(0xkato.xyz)
294 points by 0xkato 3 days ago | 83 comments
malwrar 4 hours ago | parent | next [-]

Back when ChatGPT came out, I was so shocked by how _good_ it was for an “AI” product that I simply had to know how it worked. Over the next month I ended up drawing out a block diagram on a whiteboard I have in my office, with the math involved next to each step in the blackboard. I’d puzzle about each step along the way, and the triumph of completing the drawing was also that of this sense of deep understanding. I kept that drawing up for many months after, and would gaze at it often during meetings and idle moments in wonder.

This is to say: the autoregressive decoder-only transformer llm architecture as pioneered by openai is wildly simple for how revolutionary its results are. I was reading about non-learned classical SLAM systems (uses video + handcrafted math to produce 3d mappings of physical spaces while also locating the camera in those spaces) at the time, and comparatively speaking I’d say the math is about as complicated as ONE of the components in those complex formulations. The only reason frontier LLMs need 6-figure computers to run is because the model designers made the middle bit in those models REALLY BIG, dimensionally speaking. They just took the steam engine, made a few gargantuan versions of it, and are selling them as the ultimate source of power.

This was openai’s entire breakthrough. Making this particular model architecture larger leads to emergent capabilities like being able to pick the best ending to a story/set of instructions or answer questions about broad factual knowledge. I’ve been meanwhile watching these AI companies attempt, successfully, to sell this capability as some sort of robot consciousness hand-crafted by supergeniuses. The fact that they are getting away with it is almost as shocking to me as the discovery itself.

crossroadsguy 7 minutes ago | parent | next [-]

[delayed]

Gmolomo 8 minutes ago | parent | prev | next [-]

Sooooo just because you are able to understand it, it's not worth anything?

It doesn't has any impact?

Ah wait it does. Mh weird.

Why are you not creating a startup and get rich?

antirez 31 minutes ago | parent | prev | next [-]

There is a different way to look at this: that is, actually the Transformer is a minimal complication of what the based model is: in theory the neural network could be just a huge FFN, which is anyway the part of the Transformer that does the heavy lifting. But this would be impossibile to train both numerically and computationally, so the Transformer encodes enough priors for it to work: the causal attention, and the math tricks like the residuals and so forth. But the bottom line of all this is that the Transformer works because of the incredible semantical power of simple/huge FFNs.

jfim 4 hours ago | parent | prev | next [-]

Indeed. It's pretty interesting to realize after implementing GPT-2 that the frontier models are scaled up versions of that, with various tweaks to improve performance, model-wise.

The secret sauce though is all the datasets, RL training, knowledge of what works from doing all kinds of ablation experiments, and a massive compute moat.

gobdovan 2 hours ago | parent | next [-]

The secret sauce is also having the necessary 'creativity' to not get ceased and desisted into oblivion and jail from all the copyrighted material you trained your model on. Btw, not making a morla judgement, [0] shows Michael and Dalton from YC discussing why Ilya Sutskever had to leave Google to pursue what's now ChatGPT

[0] https://youtu.be/E8pvgN1j-Ck?t=748

achrono 3 hours ago | parent | prev [-]

How do we know that today's frontier models are merely scaled up versions of that? Genuine question, since the labs have narrowed what they share over the years to now almost nothing, in terms of how the model was trained and how it works under the hood.

gobdovan 2 hours ago | parent | next [-]

DeepSeek research:

- V3 https://arxiv.org/abs/2412.19437

- V2 https://arxiv.org/abs/2405.04434

- R1 https://arxiv.org/abs/2501.12948 (RL applied to ML models was well-known beforehand, but they show it in the open, at scale, on big models)

Then, there's the incentive analysis. If you can see that these models empirically get better with scale, why would you swap the main architecture? Those events will be pretty rare. I'm not saying there's noone cooking a new architecture, just that it is a pretty rare event. And it would have to come from some researchers that would be happy to not publish their findings, which is not really what a sizable portion of elite researchers (obviously not all) are incentivized to do.

Of course, it's a bit of a verbal compression to claim simply 'scaled up'. They are recognisable scaled up transformers, but most new models come with a few tricks, but we're at the point where those usually are not an architectural rewrite and added to solve an explicit problem, like hallucination, not for big new capability gains.

matusp 2 hours ago | parent | prev | next [-]

There are thousands of people working in top level labs. Somebody would leak it

ai_slop_hater 2 hours ago | parent | prev [-]

No they are clearly not just scaled up versions of gpt 2; there are different LLM architectures like mixture of experts etc that appeared relatively recently. I am not an expert though, far from it.

otabdeveloper4 2 hours ago | parent [-]

MoE and such are basically performance enhancements, they don't make the model smarter.

yababa_y an hour ago | parent [-]

separately trained experts can surpass performance in their activated regime and DOES result in a smarter model, the Claude system cards talk about this and eg there is https://openreview.net/forum?id=iydmH9boLb to read...

ekunazanu 31 minutes ago | parent | prev | next [-]

> This was openai’s entire breakthrough. Making this particular model architecture larger leads to emergent capabilities

Basically, the bitter lesson: https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...

GardenLetter27 30 minutes ago | parent | prev | next [-]

It's not just the architecture but also the data - the decoder only approach lets you train in parallel over blocks of text (no RNN serial waiting), that allows you train on much, much more data.

wuschel 2 hours ago | parent | prev | next [-]

Could you perhaps cite the core papers for LLMs beyond „Attention is all you need“?

sigmoid10 2 hours ago | parent | next [-]

"Attention is all you need" is actually a bad paper if you want to learn about autoregressive LLMs specifically, because it describes a more complicated encoder-decoder architecture while modern LLMs are decoder only. So it's an unnecessarily hard way to get into the subject. "Language Models are Unsupervised Multitask Learners" is probably what you are looking for (aka the GPT-2 paper). This was the first time LLMs really showed what is possible, i.e. they can learn to generalize very well from unstructured data. So no more human labelling necessary, which until then was the primary bottleneck in ML. The paper also lists several key ingredients beyond transformers that are mostly still in place today. This also highlights that there was more to it than just "scaling the transformer algorithm" like many people claim. Most developments since then were about improving training data, until "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" drastically changed the architecture landscape again. Later big developments like thinking/reasoning/chain of thought/inference time compute (whatever you want to call it nowadays) are actually all about training again. They work using the exact same architecture.

blackbear_ 2 hours ago | parent | prev | next [-]

The GPT3 paper is a good starting point

Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165

I also enjoyed the papers for DeepSeek and GLM for an overview of all the tricks you need to make these things work

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models https://arxiv.org/abs/2512.02556

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models https://arxiv.org/abs/2508.06471

sharma-arjun an hour ago | parent | prev [-]

Not a core paper, but I found Formal Algorithms for Transformers [1] (a Google paper from 2022) to have a great pedagogical style.

[1] https://arxiv.org/abs/2207.09238

barrenko an hour ago | parent [-]

I'll add in here https://web.stanford.edu/~jurafsky/slp3/, "Speech and Language Processing", with chapters that deal specifically with LLMs and transformers.

10GBps 4 hours ago | parent | prev | next [-]

Yep. It's nearly identical to the neural nets we were using in the 90s. Back then even a supercomputer wasn't big enough or fast enough to do what we do today.

I have to wonder though. Is this all a human brain is? A similar thing to an LLM just scaled exponentially larger. I mean a brain is not just neurons with simple connections to each other. The neurons, axons, dendrites, <insert_unexplained_thing>, etc in a brain are all holding and processing information in different ways and doing it nearly 100% in parallel. That's a really big model.

The biological discoveries show how complex a biological brain actually is. Even the tiny brains in a bee or spider are able to solve puzzles and use tools. That's crazy.

ctolsen 2 hours ago | parent | next [-]

No, it’s definitely not what a human brain is. That makes very little sense. The ways we interact with language (and thus conceptual memory) is completely and fundamentally different.

rfv6723 an hour ago | parent [-]

Is it different though?

If we look beyond written languages which are late inventions of human civilization, oral languages are continuous and build with blocks not words.

Chomskyan school misled the entire field of linguistics for decades by ignoring spoken languages.

uoaei 19 minutes ago | parent [-]

It is different, but there may be some universal principles that are relevant more abstractly among both cases. Of particular interest is the empirical notion that statistical models of a certain form will always tend to "average out noise" and "learn meaningful patterns" up to the capacity that those models have for representing said patterns. A parallel notion to this is the hypothesis dubbed "thermodynamic origins of life". The universal principle binding these two seemingly disparate topics is one that seems to underlie any sense of "learning" in physical systems: that semantics of those systems depend on their representational power, and the semantics they do come to represent are the results of adding up many pushes in one "direction" (phase space / state space / etc.) encoding a pattern, and adding up many random noise jiggles will cancel out but give you a first-order sense of variance of those semantic features as expressed by the environment.

As this description is so overly abstract, an exercise for the reader is to try to work through an explanation of how, say, a river delta comes to "learn" about its environment by "reacting" to the influences at its borders, and how it "encodes" whatever it is that it learns in the substrate that it inhabits.

spacebacon 2 hours ago | parent | prev | next [-]

LLMs are semiotic infrastructure. You won’t find a better analogy. The cognitive frame won’t hold.

bonoboTP 3 hours ago | parent | prev | next [-]

Attention layers were not used in the 90s.

otabdeveloper4 2 hours ago | parent | prev | next [-]

> I mean a brain is not just neurons with simple connections to each other.

No, it's not. There are many animals that have extremely complex and even learned behaviour that have literally zero neurons.

Clearly "neurons" is an oversimplification just-so story, not a scientific theory.

formerly_proven 31 minutes ago | parent [-]

Do you consider fungi animals or do you perhaps mean animals that don't have a brain/CNS?

foxes 3 hours ago | parent | prev [-]

Probably better to not simply reduce it by just saying X is Y then if it has all that extra complexity and capacity.

darksim905 4 hours ago | parent | prev | next [-]

For anyone who is curious about the first paragraph here, this is actually a great video overview of how LLM works and the tokenization part.

Tangentially related: This part always seemed fuzzy to me, especially when dealing with data scientists and how they talk about how 'ML' looks at problems. I had this issue when working at a SIEM vendor where they kept going on about use case development having to be designed a certain way to catch things. It was all very frustrating.

pkoird 4 hours ago | parent | prev | next [-]

aka "the bitter lesson"

faurroar 4 hours ago | parent | prev [-]

Architectures have evolved significantly since then. DeepSeek v4 =/= GPT-3. Even then, a great deal of complexity lies in everything surrounding the architectures e.g. how do you implement them performantly on modern accelerators, how do you distribute the model across a set of accelerators, how do you post-train, etc. And pre-training itself is a dark art. If you legitimately think that frontier labs are doing something equivalent to whatever you wrote on your whiteboard, you’re clueless.

jumploops 3 hours ago | parent [-]

Those are all just optimizations.

We still don’t really know why they work, we just know how to build them.

trollbridge 3 hours ago | parent | next [-]

We don't really know why language works with humans, either. If you raise a baby from birth, you kind of observe how it is learning language, but the process is also rather mysterious. My eldest son's first word was to actually imitate a cow mooing, and then after that to imitate a motor noise of a tractor or truck. And then after that a meow. (His first complete sentence was "King Graham fell"...)

My next child took a completely different path to language, including skipping all the non-verbal imitations.

And then at some point, you just suddenly can two-way communicate with them when you couldn't before, and then after that, they can engage in reasoning.

jumploops 2 hours ago | parent | next [-]

Completely agree!

It’s interesting to me how similar attempting to understand LLMs is to neuroscience.

“When we turn this bit off, this other thing happens… if we change these weights the Eiffel Tower is now in Rome”

We’re basically just probing around and trying to reverse engineer an emergent system.

To your point, this system may be quite different from model to model (human to human) although some similarities likely occur.

The comment I was responding to tried to belittle the OP’s understanding of transformers, by mentioning that running an LLM at scale is much harder than the simple white board diagram.

My point was simply that we don’t know why they work, and all the extra optimizations isn’t the “thing” that makes it emergent.

Simply scaling the “GPT” is good enough to see it, so the OP’s awe should stand.

(On a side note, what other architectures can we scale to find similar emergent behavior?)

ai_slop_hater 2 hours ago | parent | prev [-]

Human brain capabilities are truly amazing, imagine if people didn’t treat their children as if they are stupid and didn’t constantly lie to them, because kids are stupid right, they wouldn’t understand. What heights could be reached.

baq 2 hours ago | parent | next [-]

We don’t treat children like they’re stupid, we treat children like they’re children. A stupid adult is treated very differently than any child.

Adults are expected to have their world models approximately correct in terms of physical environment so they won’t accidentally kill themselves by falling off a cliff; then there are the social norms which adults are expected to conform to so everyone is kinda predictable to everyone else so adults don’t kill each other too often over food or mates. Understanding of neither is expected from children.

ai_slop_hater 2 hours ago | parent [-]

You may have been raised properly since you don’t get what I mean. I really envy kids with “Chinese parents” that had them learn math early on and not some bullshit like that if you put your tooth under your pillow, then a tooth fairy will come.

mejutoco an hour ago | parent | next [-]

I think those 2 are orthogonal. Math still works with Santa or the tooth fairy.

ai_slop_hater an hour ago | parent [-]

Maybe math works but critical thinking doesn’t. There are people who have lived for many decades without ever questioning insane b.s. they were taught as kids.

beezlewax an hour ago | parent | prev [-]

It is possible to have learned both things you know.

pmg101 2 hours ago | parent | prev [-]

Because god forbid that childhood, the one time in your life when you don't have any responsibilities, should be fun.

ai_slop_hater an hour ago | parent [-]

Waste 22 years of life without learning anything and then slave away at a 9-5 job you hate. Brilliant strategy. At least you had “fun”. Then blame billionaires or something.

otabdeveloper4 2 hours ago | parent | prev | next [-]

We do know how they work. They predict the next statistically most likely token.

The "bitter lesson" is that fake-it-till-you-make-it is a valid way of doing knowledge work.

(Or not make it, then people will just claim you're holding the LLM wrong and it's not the AI's fault.)

throw310822 an hour ago | parent [-]

> statistically most likely token.

Statistically most likely in what context, given which preconditions? Because each prompt sequence is unique so the probability of any token following it is unknown.

slopinthebag 3 hours ago | parent | prev [-]

Hm, I wonder if it's more that we're shocked such a simple thing (relatively speaking) can work so well.

helloplanets 6 minutes ago | parent | prev | next [-]

The part about positional encoding is not correct.

> The intuition: instead of adding position info to each token’s vector, RoPE rotates the vector by an angle that depends on its position

You can't rotate the token's entire vector (or all three vectors, whatever is being implied is unclear). If you change the Value vector of the token based on its position, you break everything. It's specifically avoided. You rotate token 1's Query vector and token 2's Key vector. The rotation applied to both is based on their position, so dot product can be used to compare their relative difference in position.

Either a correction after explaining how the Query, Key and Value vectors work should be in there. Or positional embedding should just be explained after explaining the Query, Key and Value vectors.

If you've only explained how a token's Value vector works at that point in the article, there's no actual foundation taht you could use to explain it.

When the article explains the how the Query and Key vectors work only after that, the reader is building up on a wrong intuition and it gets confusing. Because at no point there's the crucial point about the Query and Key vectors being the ones that change based on the position, not static vectors that the dot product is just applied to as is. RoPE is specifically applied just before the comparison happens, so their positional difference can be accounted for.

AltruisticGapHN 5 minutes ago | parent | prev | next [-]

I don't like how most LLM explainer articles and videos say that essentially a LLM " predicts the next word".

I'm a developer but not very good at maths and I still don't understand any of it.

A LLM clearly has some "visual" capacity. You ask Gemini to build something with Canvas and it's able to reason about the shape of things. Like recently I waanted a checkbox that has like a gradient flowing around the edge. It figured out it could use a radial gradient from the center of the checkbox, and overlay that with a small inner div so you only see the edge that looks like the gradient is circling around the checkbox.

How is that "predicting the next word"?

Not saying AI is intelligent or conscious or anything like that, but the algorithm clearly is far more complex than "predicting words".

What I mean, is the LLM is able to represent things in space . That part I don't understand.

I also still dont understand the relationship between the chat based LLM and the multi modal stuff. I think I read somewhere when image is generated it is also tokens?

stalfie 23 minutes ago | parent | prev | next [-]

This article describes how Transformers work, but not really how LLMs work. Explaining the underlying architecture gives you about as much insight into how a modern LLM behaves as an breakdown of neuronal biochemistry and a few pathways does for the brain. Meaning, almost no insight at all.

10GBps 5 hours ago | parent | prev | next [-]

I learned TCP/IP by watching and reading raw packets over packet radio at 1200 baud.

I've noticed the same thing is possible if you watch the output of a slow LLM. Eventually you start to see the machinery. input tokens = output tokens, it's math. I can't exactly predict the tokens generated but I can see how they are formed. It's a lot like chess. You can't see every possible move but the mechanism is understandable.

trollbridge 3 hours ago | parent | next [-]

Comment <-> username synergy.

Maledictus 2 hours ago | parent | prev | next [-]

How would I set this up?

barrenko an hour ago | parent [-]

I'd recommend to maybe also specifically watching Karpathy's videos and focusing on the early parts where he specifically deals with tokenization / embeddings generation (which gets really overlooked), and he does this in most of his videos.

fragmede 2 hours ago | parent | prev [-]

https://distill.pub/2019/activation-atlas/

I can only imagine what sort of visualizations are going on today inside of the AI labs.

vocram an hour ago | parent | prev | next [-]

Saying an article is of inferior quality just because editing was AI-assisted is like saying a book is lower quality just because it was printed rather than written by hand

lateral_cloud 38 minutes ago | parent | next [-]

AI assisted is a stretch. And that analogy isn't even close to being relevant

Laurel1234 15 minutes ago | parent | prev | next [-]

Rather interesting than clanker slop defenders downplay the clanker aspect and highlight the human by calling it "ai-assisted", which defeats their entire point.

I hope you do some introspection and start consciously recognizing that the human input and the clanker slop is just debasing it.

janalsncm 20 minutes ago | parent | prev | next [-]

Not just that, I think a lot of people are going to waste their time losing the battle (and make no mistake, they will lose) fighting against AI writing without ever asking themselves what makes writing good in the first place.

There’s good AI writing and bad organic writing. But it’s easier to point out a few LLM-isms than to actually identify the problems with text.

bspammer an hour ago | parent | prev [-]

No? One affects the actual text and the other doesn’t.

andai 6 hours ago | parent | prev | next [-]

I couldn't load the article directly due to an SSL issue, so here's the archive link:

https://archive.ph/aWtFG

melvinroest 2 hours ago | parent | prev | next [-]

I thought Karpathy’s microgpt explain how LLMs work

disgruntledphd2 an hour ago | parent [-]

Microgpt is really good, if you want to understand exactly what happens. I still thought that this article was a good, higher-level complement to that article though.

aabdi 3 hours ago | parent | prev | next [-]

this is hard to read...

it goes all over the place.

i'm not actually sure who your target audience is.

there's too many side tangents.

just like, structure it plz.

1. customer feels bad cuz they don't understand how llms work

2. provide high level abstracted explanation (don't dive into concepts yet)

3. provide breakdown guide of overall set of components.

4. walk through each component. don't side track. no need to explain, ROPE,GQA etc... it just distracts.

i.e. customers don't know how llms work, leading them to feel bad about their own intelligence.

at a high level llms take in words, do some math on them, and then produce words, one by one.

inside llms have these different components. we walk through them step by step.

1. tokenizer

2. embedding

3. attention

4. heads

5. ffn

6. sampling

## tokenizer

barrenko an hour ago | parent [-]

It's just slop.

cubefox 2 hours ago | parent | prev | next [-]

We are living in a crazy science fiction world where on the top of the HN frontpage there is an article on how LLMs work which is likely itself LLM generated, and the only way to tell is its writing style rather than its factual accuracy.

lhd1 5 hours ago | parent | prev | next [-]

find it difficult to engage with AI generated text. What am I getting here that I couldn't get from a chatbot.

blackoil 4 hours ago | parent | next [-]

Hopefully someone has asked right questions and removed confusing answers/hallucinations.

dialsMavis 4 hours ago | parent | prev [-]

Is this text generated by AI? I couldn't tell but I'd believe it if it was.

I imagine if resources were spent writing this text then one benefit of using it is not using more resources or the pollution caused from a chatbot.

zemo 3 hours ago | parent | next [-]

normal people talk and write with some notion of meter, the cadence of communicating where pauses are inserted at places that naturally suit the speaker (and listener) to pause for thought. LLM's don't really do that, they just write a bunch of sentences.

> Researchers have found that some neurons inside the FFN are strongly associated with specific concepts or facts. One neuron might activate strongly on Eiffel-Tower-related text. Another on programming languages. Another on past-tense verbs.

People don't really write like this and they don't really talk like this (and no, people don't necessarily write exactly how they talk because they don't read exactly how they listen; the written word can be backtracked while the heard cannot, and speakers/writers know this, either consciously or unconsciously). A person would probably structure this more like:

> Researchers have found that some neurons inside the FFN are strongly associated with specific concepts or facts. For example, there could be one neuron that activates strongly on Eiffel-Tower-related text, another that activates strongly on programming languages, a third neuron activating on past-tense verbs, and so on.

Usually people wouldn't write "Another on programming languages." as a standalone sentence like that because the periods introduce an unnatural pause like they're giving a TED talk, unless of course they were punctuating that way for effect, but you'd essentially never communicate with that effect full time.

mattnewton 3 hours ago | parent [-]

I don’t disagree with your conclusion that this is likely ai rewritten, but I do find it strange that you say “normal people don’t write like this” when it is mimicking how people write, and using patterns I have seen people write. I think models are at the point where style is not really reliable as an indicator anymore.

AgentMatt 2 hours ago | parent | next [-]

I'm sure there's plenty of writing in the above style to be found on the Internet, and hence having been trained on by the LLM. I'm also not a fan of this style, and in particular I'd say it's rarely or never found in scientific / technical writing meant to convey understanding rather than sell or hype. So here it's IMO more of a style mismatch.

thin_carapace 2 hours ago | parent | prev [-]

people sure do write like that, in novels. nobody writes scientific articles like novels, because scientific articles don't need to maximally capture audience attention. the purpose of a scientific article is to convey information - this pursuit is not assisted by punchy prose.

rippeltippel 4 hours ago | parent | prev [-]

The voice of several passages resembles ChatGPT very closely.

spacebacon 2 hours ago | parent | prev | next [-]

But how do they “think”? This is the only repo that can tell you that.

https://github.com/space-bacon/SRT

lateral_cloud an hour ago | parent | prev | next [-]

I don't understand how these AI written articles get so many votes.

lionkor an hour ago | parent | prev | next [-]

It sucks that this article is clearly LLM edited, with common phrases like "same shape as", "the intuition: ", and the "tiny explainer" which clearly generalized from a prompt accidentally.

Good article, but when sharing it I will have to preface "yes it's slop, but it's a good explanation".

Absolutely embarrassing that the author didn't catch that these LLM-isms are a (and here I'll use one) bad signal.

In fact, I would go so far as to say that publishing in this style stems from a lack of reading experience and writing experience, which does not bode well for someone pretending to be an expert. I gave this article to someone highly intelligent who doesn't know the first thing about how LLMs work internally, and she immediately called out that it reads like AI text.

janalsncm 26 minutes ago | parent [-]

I don’t think it’s absolutely embarrassing. First of all, the point of the author writing at all is to aid understanding, not produce prose. So from that standpoint, what would be embarrassing would be to include incorrect facts that suggest a fundamental misunderstanding of the topic.

From my read, it is fine. The brief history of LLMs is complicated since every single component has papers introducing enhancements. So it’s easy to ignore them or get bogged down with details.

The author appears to be a security researcher learning about LLMs for the purpose of defending against common attacks. So this piece is that person giving themselves a crash course on the topic. The fact that they cleaned up their notes with an LLM is frankly completely irrelevant.

codeakki 3 hours ago | parent | prev | next [-]

What's the point of this? Im not here to engage with AI bots

whateveracct 3 hours ago | parent | prev | next [-]

accidentally quadratic

singpolyma3 6 hours ago | parent | prev [-]

Next do "why LLMs work"

krackers 4 hours ago | parent | next [-]

See Tegmark's "why does deep cheap learning work so well" (well not so cheap anymore...)

https://www.youtube.com/watch?v=5MdSE-N0bxs is remarkably prescient given that it was written before LLMs

sheeshkebab 5 hours ago | parent | prev | next [-]

considering they work with any architecture/configuration given enough compute, just more or less efficiently - then maybe it's fundamental, in the same sense as why electricity works...

soupspaces 5 hours ago | parent | prev | next [-]

Universal approximation theorem, embeddings, self-attention, gradient descent. And empirically, scaling laws.

skydhash 5 hours ago | parent | prev [-]

Why does linear regression works? Why does computer works? Because it's about math and the encoding information. If we can encode words as numbers, then why can't we encode their order as a relation? It's just that neural networks are very apt at finding that relation even if it's noisy.