Remix.run Logo
Forcing Flash Attention onto a TPU and Learning the Hard Way(archerzhang.me)
48 points by azhng 5 days ago | 12 comments
ColonelPhantom an hour ago | parent | next [-]

Interesting read! One remark though: I'm not too familiar with the architecture of a Google TPU, but comparing the TPU's VMEM with Nvidia's shared memory feels wrong to me.

Looking at the size, and its shared nature, it feels far more natural to compare with the L2 cache, which is also shared across the entire GPU and is in the same order of size (40MB on the listed A100).

FL33TW00D 3 hours ago | parent | prev | next [-]

Why ruin good work by letting Claude write it all? Full of em dashes, riddled with Claudisms.

gdiamos 3 hours ago | parent | next [-]

I personally don't mind letting Claude write about work.

You could spend 80% doing the work and 20% writing about it, or 99% doing the work and 1% copy-pasting Claude's writeup about it into a blog.

There is nothing wrong with writing if you are into it, and yes you can probably do better than Claude, but I can related to engineers who just want to build.

spzb 3 hours ago | parent | next [-]

If you can’t be bothered to write it, why should I bother to read it?

cannonpr an hour ago | parent [-]

Because it contains information of value to you ? I mean if it doesn’t, just don’t read it.

selfhoster11 3 hours ago | parent | prev | next [-]

I could spend 100% doing the work with my own Claude, and 0% reading yours. That's a negative-sum outcome. I do think that the 80%/20% split is better (though anything that is mostly human voice is fine for me).

Groxx 2 hours ago | parent | prev [-]

Because the failures are so frequent and often load-bearing that it makes it a negative sum to even attempt to read stuff that appears generated.

skybrian 2 hours ago | parent | prev [-]

Why let an obsession with writing style prevent you from learning from a reasonably decent writeup?

JSR_FDED 25 minutes ago | parent [-]

He’s doing the author a favor

gdiamos 5 hours ago | parent | prev | next [-]

One of my lessons in using different accelerators, whether they be different NVIDIA versions, or GPU->TPU, etc is that someone needs to do this work of indexing, partitioning, mapping, scheduling, and benchmarking. That work is labor intensive.

In this case, google has already done it, and that will be true for high resourced accelerator companies like Google working with the most popular operations like attention.

As long as you use those operations, you are okay. But if you do something different, you need to be prepared to do all of this yourself.

refulgentis 5 hours ago | parent | prev [-]

It broke my heart to have a visceral "I'm being slop'd" reaction reading this: it's such good work, and AI's barely used AFAICT, but there's enough odd transitions and copy-pasta'd markdown that you get the subconcious "this is AI" reaction regardless.

Many sentences are 3x as long as it normally would be in subtle ways (to wit: "My flash attention is 35x slower than the fused standard at n=4096. Not a little worse. Catastrophically worse."), it really wears on attention. (pun intended) It brings literary voice to a technical blog post, and a very difficult process-oriented technical blog post. I have to reallocate my unfortunately-limited brain cells from "maintaining state of where we are in the process" to "is this cutesy fluff or important" and I've never had to do that in 37 years with technical blog posts.

The Markdown gets bad. Bolding is used for important phrases (like a human would), then, all of a sudden, after the "Inside a TPU chip" header its being used every other sentence, on anything that is a proper noun/would have a Wikipedia article. It got so weird that at some point I was like "a human definitely didn't let this through...they must be links?" and tried clicking them.

It's doubly bad at that point, because markdown tables start coming in hot and heavy too. So you're left with "It's pretty apparent the LLM did it from here, and I can't keep trying to keep the state of the process in my head while trying to figure out if the bolding is important, reflexive close tab

jacquesm 3 hours ago | parent [-]

You got a lot further than I did.