At the time getting complete sentences was extremely difficult! N-gram models were essentially the best we had

albertzeyer 4 days ago | parent | next [-]

No, it was not difficult at all. I really wonder why they have such a bad example here for GPT1.

See for example this popular blog post: https://karpathy.github.io/2015/05/21/rnn-effectiveness/

That was in 2015, with RNN LMs, which are all much much weaker in that blog post compared GPT1.

And already looking at those examples in 2015, you could maybe see the future potential. But no-one was thinking that scaling up would work as effective as it does.

2015 is also by far not the first time where we had such LMs. Mikolov has done RNN LMs since 2010, or Sutskever in 2011. You might find even earlier examples of NN LMs.

(Before that, state-of-the-art was mostly N-grams.)

	▲	versteegen 4 days ago \| parent [-]
		Thanks for posting some of the history... "You might find even earlier examples" is pretty tongue-in-cheek though. [1], expanded in 2003 into [2], has 12466 citations, 299 by 2011 (according to Google Scholar which seems to conflate the two versions). The abstract [2] mentions that their "large models (with millions of parameters)" "significantly improves on state-of-the-art n-gram models, and... allows to take advantage of longer contexts." Progress between 2000 and 2017 (transformers) was slow and models barely got bigger. And what people forget about Mikolov's word2vec (2013) was that it actually took a huge step backwards from the NNs like [1] that inspired it, removing all the hidden layers in order to be able to train fast on lots of data. [1] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, 2000, NIPS, A Neural Probabilistic Language Model [2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin, 2003, JMLR, A Neural Probabilistic Language Model, https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

▲

macleginn 4 days ago | parent | prev [-]

Ngram models had been superceded by RNNs by that time. RNNs struggled with long-range dependencies, but useful ngrams were essentially capped at n=5 because of sparsity, and RNNs could do better than that.