GPT-1 wasn't used as a zero-shot text generator; that wasn't why it was impressive. The way GPT-1 was used was as a base model to be fine-tuned on downstream tasks. It was the first case of a (fine-tuned) base Transformer model just trivially blowing everything else out of the water. Before this, people were coming up with bespoke systems for different tasks (a simple example is that for SQuAD a passage-question-answering task, people would have an LSTM to read the passage and another LSTM to read the question, because of course those are different sub-tasks with different requirements and should have different sub-models). One GPT-1 came out, you just dumped all the text into the context, YOLO fine-tuned it, and trivially got state on the art on the task. On EVERY NLP task.

Overnight, GPT-1 single-handedly upset the whole field. It was somewhat overshadowed by BERT and T5 models that came out very shortly after, which tended to perform even better on the pretrain-and-finetune format. Nevertheless, the success of GPT-1 definitely already warrants scaling up the approach.

A better question is how OpenAI decided to scale GPT-2 to GPT-3. It was an awkward in-between model. It generated better text for sure, but the zero-shot performance reported in the paper, while neat, was not great at all. On the flip side, its fine-tuned task performance paled compared to much smaller encoder-only Transformers. (The answer is: scaling laws allowed for predictable increases in performance.)

▲

gnerd00 4 days ago | parent [-]

> Transformer model just trivially blowing everything else out of the water

no, this is the winners rewriting history. Transformer style encoders are now applied to lots and lots of disciplines but they do not "trivially" do anything. The hype re-telling is obscuring the facts of history. Specifically in human language text translation, "Attention is All You Need" Transformers did "blow others out of the water" yes, for that application.

	▲	arugulum 3 days ago \| parent [-]
		My statement was >a (fine-tuned) base Transformer model just trivially blowing everything else out of the water "Attention is All You Need" was a Transformer model trained specifically for translation, blowing all other translation models out of the water. It was not fine-tuned for tasks other than what the model was trained from scratch for. GPT-1/BERT were significant because they showed that you can pretrain one base model and use it for "everything".