Remix.run Logo
versteegen 4 days ago

Thanks for posting some of the history... "You might find even earlier examples" is pretty tongue-in-cheek though. [1], expanded in 2003 into [2], has 12466 citations, 299 by 2011 (according to Google Scholar which seems to conflate the two versions). The abstract [2] mentions that their "large models (with millions of parameters)" "significantly improves on state-of-the-art n-gram models, and... allows to take advantage of longer contexts." Progress between 2000 and 2017 (transformers) was slow and models barely got bigger.

And what people forget about Mikolov's word2vec (2013) was that it actually took a huge step backwards from the NNs like [1] that inspired it, removing all the hidden layers in order to be able to train fast on lots of data.

[1] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, 2000, NIPS, A Neural Probabilistic Language Model

[2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin, 2003, JMLR, A Neural Probabilistic Language Model, https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf