| ▲ | mlmonkey 4 hours ago |
| For those interested, Wired ran a backstory about the Attention is All You Need paper 2 years ago: https://www.wired.com/story/eight-google-employees-invented-... It gives some context on the contributions of each of the authors.
About Shazeer, from the article: Shazeer’s joining the group was critical. “These theoretical or intuitive mechanisms, like self-attention, always require very careful implementation, often by a small number of experienced ‘magicians,’ to even show any signs of life,” says Uszkoreit. Shazeer began to work his sorcery right away. He decided to write his own version of the transformer team’s code. “I took the basic idea and made the thing up myself,” he says. Occasionally he asked Kaiser questions, but mostly, he says, he “just acted on it for a while and came back and said, ‘Look, it works.’” Using what team members would later describe with words like “magic” and “alchemy” and “bells and whistles,” he had taken the system to a new level. |
|
| ▲ | SiempreViernes 4 hours ago | parent [-] |
| > Using what team members would later describe with words like “magic” and “alchemy” and “bells and whistles,” Ok, these peopl have all gotten extensive training on how to hype for the non-technical crowd without saying anything of substance. |
| |
| ▲ | ahmadyan 2 hours ago | parent | next [-] | | As a hacker, I kinda like naom's code. I was had to implement a TC MoE kernel, and stumbled upon his code from [tensor2tensor](https://github.com/tensorflow/tensor2tensor/blob/master/tens...) and i think "alchemy" is justified. Dude writes some beautiful kernels. He also saw LLM would replace search before anyone else, and that is something to look at the Lamda or GPT-1's output and think: yeah this will answer all of our questions one day. | | |
| ▲ | jvican an hour ago | parent | next [-] | | There's no doubt about Noam's abilities. But I read through that code, and struggle to see its 'magic' or 'alchemy'. Can you elaborate what you find especially good about that code? (You may assume GPU kernel programming knowledge on my end.) | | |
| ▲ | dekhn an hour ago | parent | next [-] | | To me the magic Noam moment was when he came to my team and said "that cluster has a bad node in it, but this other one doesn't" and we had to spend like a week tracking down a single bad processor out of thousands. | |
| ▲ | jeswin an hour ago | parent | prev [-] | | Unrelated to the particular code above. There's a difference between writing code about or adjacent to a proven idea vs writing code in uncharted territory. I suspect that is what happened here. It's the same thing with say music and art. A lot of people today can play Chuck Berry. | | |
| ▲ | jvican an hour ago | parent [-] | | It's a good point. Though I do wonder if the magic he casted was more at the conceptual level (intense belief on a set of primitives that ought to work) more than the code itself. Even by 2018's standards, the Tensorflow code above doesn't really look that impressive. It's hard to judge based on those past standards, though. But, wonder if somebody who knows more than me can elaborate. |
|
| |
| ▲ | eli_gottlieb an hour ago | parent | prev [-] | | Also, evaluating complicated functions with numerical stability and automatic differentiation is hard. |
| |
| ▲ | dang 3 hours ago | parent | prev | next [-] | | "Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith." https://news.ycombinator.com/newsguidelines.html | | |
| ▲ | anon_shill 2 hours ago | parent [-] | | Does that apply to quotes from an article? They seemed to be criticizing a second or third degree source for being PR, which feels fair. | | |
| ▲ | dang 2 hours ago | parent [-] | | Yes, in the sense that if there's nothing interesting to say about a quote then there's no reason to copy it into the thread. | | |
| ▲ | nrds 2 hours ago | parent [-] | | And, of course, "interesting" means "interesting to dang"; whether apparently technical sources have apparently received PR training is therefore not "interesting". Drop the "my preferences are actually just objective truth" routine for once. Why is it so painful to admit you curate this site by preference? | | |
|
|
| |
| ▲ | epihelix 2 hours ago | parent | prev [-] | | The "bells and whistles" label sounds more dismissive / perjorative to me. An odd, and not a particularly nice, thing to say. Makes me wonder how the "magic" and "alchemy" terms were intended in this case, also. |
|