| ▲ | sigmoid10 3 hours ago | |
"Attention is all you need" is actually a bad paper if you want to learn about autoregressive LLMs specifically, because it describes a more complicated encoder-decoder architecture while modern LLMs are decoder only. So it's an unnecessarily hard way to get into the subject. "Language Models are Unsupervised Multitask Learners" is probably what you are looking for (aka the GPT-2 paper). This was the first time LLMs really showed what is possible, i.e. they can learn to generalize very well from unstructured data. So no more human labelling necessary, which until then was the primary bottleneck in ML. The paper also lists several key ingredients beyond transformers that are mostly still in place today. This also highlights that there was more to it than just "scaling the transformer algorithm" like many people claim. Most developments since then were about improving training data, until "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" drastically changed the architecture landscape again. Later big developments like thinking/reasoning/chain of thought/inference time compute (whatever you want to call it nowadays) are actually all about training again. They work using the exact same architecture. | ||
| ▲ | redox99 an hour ago | parent [-] | |
Chain of Thought was kind of an obvious solution that everybody knew was necessary by the time chatgpt / gpt4 came out. It was just a matter of time that frontier labs actually shipped it. MoE was also pretty straightforward, just a bit surprising how well it worked (that you can get away with just 1/32 active parameters), but most researchers would have come up with it on their own probably. The true ground breaking papers are the first two you mentioned (transformers and gpt2), and InstructGPT was also very surprising that it worked so well. | ||