Noam Shazeer was one of the lead authors of the seminal paper "Attention Is All You Need", which introduced the transformer architecture. (From Wikipedia)

▲ tmule 6 hours ago | parent [-]

This understates his criticality. The author list was randomized, but the critical idea was truly his. Wonder what this says about GDM …

▲ HarHarVeryFunny 5 hours ago | parent | next [-]

The architecture was Shazeer's, but the rough idea came from Jakob Uszkoreit who initiated the project.

Uszkoreit wanted to build a more efficient/scalable language/seq2seq model that could take advantage of GPU parallelism (replacing RNNs which were the main approach to sequence modelling at that time).

Uszkoreit's insight was that although language appears sequential, it is in fact really part parallel part hierarchical, as can be seen by linguist's sentence parse trees where at each level there is parallelism/independence between the branches of the tree, with them getting combined at the next level up. This is what gave rise to the idea of a model that consisted of a stack of of parallel processing layers (transformer layers). I believe that attention was also part of the plan from day one, as this had already been proven to be valuable (Bahdanau) with RNN seq2seq modelling.

So, this is what Uszkoreit wanted to build, but by his own account he failed to come up with an implementation that matched or outperformed the prevailing RNN approach that he wanted to replace. At this point, Uszkoreit mentioned the idea to Shazeer, who got on board and eventually arrived at a performant architecture which was then pared back by an ablation process resulting in the initial encoder-decoder Transformer architecture. Shazeer later came up with the mixture-of-experts architecture, and also other optimizations after he left to found character.ai

▲

abixb 3 hours ago | parent | next [-]

Curious about others' contributions, such as Vaswani, Parmar, Jones and Gomez, to the paper. What sucks about co-authorship in research papers is that you don't get a clean breakdown of who contributed what to the research paper, and the distribution (in more cases than not) is very much like a pareto distribution.

I'm talking from plenty of group project experience here.

▲

senordevnyc 3 hours ago | parent | prev [-]

Can you expound on the ablation process? Is that referring to a stripping down of the data or weights or something? Or a stripping down of the transformer architecture structurally? Just curious

	▲	tedd4u 3 hours ago \| parent [-]
		You train the model then do a baseline evaluation. Then you evaluate many variants where you have removed or nulled out different layers or chunks of the model. By comparing the performance of those mutated models to the baseline you can learn a lot about the model. What parts don't have much value and can be removed, the location of "functions" or "facts." Etc. Google it.

▲ flebron 5 hours ago | parent | prev | next [-]

Source for this? The notion of attention dates to a content-addressable lookup during sequence alignment (as well as, concurrently, memory lookups in neural Turing machines). Attention had been used in other models, like GRUs and LSTMs with attention. The Vaswani et. al. paper did not introduce attention, just removed everything _but_ attention (and FFW) from the network. Are you claiming the "critical idea" of removing the GRU and LSTM parts and just keeping attention was "truly" Noam's?

▲ daemonologist 5 hours ago | parent [-]

At some point in late 2017 the paper was updated with this additional detail:

    Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.

In any case, if the authors considered their contributions equal, that's good enough for me.

	▲	tmule 3 hours ago \| parent [-]
		Thanks - wanted to point to this, and indeed should have worded my claim more precisely. And yes, am aware of prior work on attention. (I need to look it up, but I recall Noam saying publicly that he wouldn’t have agreed to random ordering of contributions if he knew this was going to be this big).

▲ mi_lk 5 hours ago | parent | prev | next [-]

I don't know we can just say things now. Ah we're on the internet

▲ dyauspitr an hour ago | parent | prev | next [-]

That’s not true. Jakob, Ashish and Ilia for the core idea and initial implementation and Noam for several critical details on implementation.

▲ d4rkp4ttern 5 hours ago | parent | prev | next [-]

Is this a generally well known thing?

	▲	tmule 3 hours ago \| parent [-]
		Nope, but it’s not particularly unknown either. It shouldn’t be a surprise; he had remarkable research contributions before and after (separately, he was also an IMO gold medalist).

▲ markdown 6 hours ago | parent | prev [-]

Even more important, I wonder what it says about HBW...

▲

khazhoux 5 hours ago | parent [-]

Even if we knew, we’d still fail to understand GHO

	▲	fastball 5 hours ago \| parent [-]
		But more importantly the impact this has on TLAs