Remix.run Logo
arugulum 3 days ago

My statement was

>a (fine-tuned) base Transformer model just trivially blowing everything else out of the water

"Attention is All You Need" was a Transformer model trained specifically for translation, blowing all other translation models out of the water. It was not fine-tuned for tasks other than what the model was trained from scratch for.

GPT-1/BERT were significant because they showed that you can pretrain one base model and use it for "everything".