For those interested, Wired ran a backstory about the Attention is All You Need paper 2 years ago: https://www.wired.com/story/eight-google-employees-invented-...

It gives some context on the contributions of each of the authors. About Shazeer, from the article:

Shazeer’s joining the group was critical. “These theoretical or intuitive mechanisms, like self-attention, always require very careful implementation, often by a small number of experienced ‘magicians,’ to even show any signs of life,” says Uszkoreit. Shazeer began to work his sorcery right away. He decided to write his own version of the transformer team’s code. “I took the basic idea and made the thing up myself,” he says. Occasionally he asked Kaiser questions, but mostly, he says, he “just acted on it for a while and came back and said, ‘Look, it works.’” Using what team members would later describe with words like “magic” and “alchemy” and “bells and whistles,” he had taken the system to a new level.

▲

SiempreViernes 4 hours ago | parent [-]

> Using what team members would later describe with words like “magic” and “alchemy” and “bells and whistles,”

Ok, these peopl have all gotten extensive training on how to hype for the non-technical crowd without saying anything of substance.

▲

ahmadyan 2 hours ago | parent | next [-]

As a hacker, I kinda like naom's code. I was had to implement a TC MoE kernel, and stumbled upon his code from [tensor2tensor](https://github.com/tensorflow/tensor2tensor/blob/master/tens...) and i think "alchemy" is justified. Dude writes some beautiful kernels.

He also saw LLM would replace search before anyone else, and that is something to look at the Lamda or GPT-1's output and think: yeah this will answer all of our questions one day.

▲

jvican an hour ago | parent | next [-]

There's no doubt about Noam's abilities. But I read through that code, and struggle to see its 'magic' or 'alchemy'. Can you elaborate what you find especially good about that code? (You may assume GPU kernel programming knowledge on my end.)

▲

dekhn an hour ago | parent | next [-]

To me the magic Noam moment was when he came to my team and said "that cluster has a bad node in it, but this other one doesn't" and we had to spend like a week tracking down a single bad processor out of thousands.

▲

jeswin an hour ago | parent | prev [-]

Unrelated to the particular code above. There's a difference between writing code about or adjacent to a proven idea vs writing code in uncharted territory. I suspect that is what happened here. It's the same thing with say music and art. A lot of people today can play Chuck Berry.

	▲	jvican an hour ago \| parent [-]
		It's a good point. Though I do wonder if the magic he casted was more at the conceptual level (intense belief on a set of primitives that ought to work) more than the code itself. Even by 2018's standards, the Tensorflow code above doesn't really look that impressive. It's hard to judge based on those past standards, though. But, wonder if somebody who knows more than me can elaborate.

▲

eli_gottlieb an hour ago | parent | prev [-]

Also, evaluating complicated functions with numerical stability and automatic differentiation is hard.

▲

dang 3 hours ago | parent | prev | next [-]

"Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith."

https://news.ycombinator.com/newsguidelines.html

▲

anon_shill 2 hours ago | parent [-]

Does that apply to quotes from an article? They seemed to be criticizing a second or third degree source for being PR, which feels fair.

▲

dang 2 hours ago | parent [-]

Yes, in the sense that if there's nothing interesting to say about a quote then there's no reason to copy it into the thread.

▲

nrds 2 hours ago | parent [-]

And, of course, "interesting" means "interesting to dang"; whether apparently technical sources have apparently received PR training is therefore not "interesting". Drop the "my preferences are actually just objective truth" routine for once. Why is it so painful to admit you curate this site by preference?

▲

dang 41 minutes ago | parent [-]

It's not a question of painful - I'm happy to "admit" what's true, as best I can, and not what's not true. Let's see if we can sort that out a bit in the present case.

HN is certainly curated - I've been "admitting" that since the day I got outed as a mod here:

https://news.ycombinator.com/item?id=7494621 (March 2014)

https://news.ycombinator.com/item?id=7507229 (April 2014)

https://news.ycombinator.com/item?id=7962942 (June 2014)

https://news.ycombinator.com/item?id=8569117 (Nov 2014)

https://news.ycombinator.com/item?id=15556105 (Oct 2017)

But we try hard to do the curation by principle, not by personal whim. What principles? Really there's just one: intellectual curiosity—we try to feature what enhances that and dampen what degrades it [1]. From that starting point, though, you can derive lots of other principles. Probably the most important is that snark and indignation are bad for HN (especially in combination!) because they drown out curious conversation. That's all that you need to see why I posted that reply to the GP; no personal preference required.

[1] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...

▲

rat9988 25 minutes ago | parent [-]

The current case still seems very heavy on personal preference. Principles application is subjective as we are all human. I found the comment as interesting as the quote it is answering.

	▲	dang 23 minutes ago \| parent [-]
		It does seem more of a borderline case to me when I reread it, too.

▲

epihelix 2 hours ago | parent | prev [-]

The "bells and whistles" label sounds more dismissive / perjorative to me. An odd, and not a particularly nice, thing to say. Makes me wonder how the "magic" and "alchemy" terms were intended in this case, also.