I'd love to know more about how OpenAI (or Alec Radford et al.) even decided GPT-1 was worth investing more into. At a glance the output is barely distinguishable from Markov chains. If in 2018 you told me that scaling the algorithm up 100-1000x would lead to computers talking to people/coding/reasoning/beating the IMO I'd tell you to take your meds.

▲

arugulum 4 days ago | parent | next [-]

GPT-1 wasn't used as a zero-shot text generator; that wasn't why it was impressive. The way GPT-1 was used was as a base model to be fine-tuned on downstream tasks. It was the first case of a (fine-tuned) base Transformer model just trivially blowing everything else out of the water. Before this, people were coming up with bespoke systems for different tasks (a simple example is that for SQuAD a passage-question-answering task, people would have an LSTM to read the passage and another LSTM to read the question, because of course those are different sub-tasks with different requirements and should have different sub-models). One GPT-1 came out, you just dumped all the text into the context, YOLO fine-tuned it, and trivially got state on the art on the task. On EVERY NLP task.

Overnight, GPT-1 single-handedly upset the whole field. It was somewhat overshadowed by BERT and T5 models that came out very shortly after, which tended to perform even better on the pretrain-and-finetune format. Nevertheless, the success of GPT-1 definitely already warrants scaling up the approach.

A better question is how OpenAI decided to scale GPT-2 to GPT-3. It was an awkward in-between model. It generated better text for sure, but the zero-shot performance reported in the paper, while neat, was not great at all. On the flip side, its fine-tuned task performance paled compared to much smaller encoder-only Transformers. (The answer is: scaling laws allowed for predictable increases in performance.)

▲

gnerd00 4 days ago | parent [-]

> Transformer model just trivially blowing everything else out of the water

no, this is the winners rewriting history. Transformer style encoders are now applied to lots and lots of disciplines but they do not "trivially" do anything. The hype re-telling is obscuring the facts of history. Specifically in human language text translation, "Attention is All You Need" Transformers did "blow others out of the water" yes, for that application.

	▲	arugulum 3 days ago \| parent [-]
		My statement was >a (fine-tuned) base Transformer model just trivially blowing everything else out of the water "Attention is All You Need" was a Transformer model trained specifically for translation, blowing all other translation models out of the water. It was not fine-tuned for tasks other than what the model was trained from scratch for. GPT-1/BERT were significant because they showed that you can pretrain one base model and use it for "everything".

▲

hadlock 4 days ago | parent | prev | next [-]

There's a performance plateau with training time and number of parameters and then once you get over "the hump" error rate starts going down again almost linearly. GPT existed before OpenAI but it was theorized that the plateau was a dead end. The sell to VCs in the early gpt3 era was "with enough compute, enough time, and enough parameters... it'll probably just start thinking and then we have AGI". Sometime around the o3 era they realized they'd hit a wall and performance actually started to decrease as they added more parameters and time. But yeah basically at the time they needed money for more compute parameters and time. I would have loved to have been a fly on the wall in those "AGI" pitches. Don't forget Microsoft's agreement with OpenAI specifically concludes with the invention of AGI. at the time getting over the hump it really did look like we were gonna do AGI in a few months.

I'm really looking forward to "the social network" treatment movie about OpenAI whenever that happens

▲

whimsicalism 4 days ago | parent [-]

source? i work in this field and have never heard of the initial plateau you are referring

	▲	reasonableklout 2 days ago \| parent [-]
		Maybe hadlock is thinking of double descent? https://openai.com/index/deep-double-descent/

▲

muzani 4 days ago | parent | prev | next [-]

I don't have a source for this (there's probably no sources from anything back then) but anecdotally, someone at an AI/ML talk said they just added more data and quality went up. Doubling the data doubled the quality. With other breakthroughs, people saw diminishing gains. It's sort of why Sam back then tweeted that he expected the amount of intelligence to double every N years.

I have the feeling they kept on this until GPT-4o (which was a different kind of data).

▲

robrenaud 4 days ago | parent [-]

The input size to output quality mapping is not linear. This is why we are in the regime of "build nuclear power plants to power datacenters". Fixed size improvements in loss require exponential increases in parameters/compute/data.

	▲	brookst 4 days ago \| parent [-]
		Most of the reason we are re-commissioning a nuclear power plant is demand for quantity, not quality. If demand for compute had scaled this fast in the 1970’s, the sudden need for billions of CPUs would not have disproven Moore’s law. It is also true that mere doubling of training data quantity does not double output quality, but that’s orthogonal to power demand at inference time. Even if output quality doubled in that case, it would just mean that much more demand and therefore power needs.

▲

kevindamm 4 days ago | parent | prev | next [-]

Transformers can train models with much larger parameter sizes compared to other model architectures (with the same amount of compute and time), so it has an evident advantage in terms of being able to scale. Whether scaling the models up to multi-billion parameters would eventually pay out was still a bet but it wasn't a wild bet out of nowhere.

▲

4 days ago | parent | prev | next [-]

[deleted]

▲

stavros 4 days ago | parent | prev | next [-]

I assume the cost was just very low? If it was 50-100k, maybe they figured they'd just try and see.

▲

reasonableklout 4 days ago | parent [-]

Oh yes, according to [1], training GPT-2 1.5B cost $50k in 2019 (reproduced in 2024 for $672!).

[1]: https://www.reddit.com/r/mlscaling/comments/1d3a793/andrej_k...

	▲	stavros 4 days ago \| parent [-]
		That makes sense, and it was definitely impressive for $50k.

▲

therein 4 days ago | parent | prev [-]

Probably prior DARPA research or something.

Also slightly tangentially, people will tell me it is that it was new and novel and that's why we were impressed but I almost think things went downhill after ChatGPT 3. I felt like 2.5 (or whatever they called it) was able to give better insights from the model weights itself. The moment tool use became a thing and we started doing RAGs and memory and search engine tool use, it actually got worse.

I am also pretty sure we are lobotomizing the things that would feel closer to critical thinking by training it to be sensitive of the taboo of the day. I suspect earlier ones were less broken due to that.

How would it distinguish and decide between knowing something from training and needing to use a tool to synthesize a response anyway?